MLN repo autocleanup
This commit is contained in:
@@ -1,255 +1,255 @@
|
||||
# Azure Machine Learning Data Prep SDK
|
||||
|
||||
The Azure Machine Learning Data Prep SDK helps data scientists explore, cleanse and transform data for machine learning workflows in any Python environment.
|
||||
|
||||
Key benefits to the SDK:
|
||||
- Cross-platform functionality. Write with a single SDK and run it on Windows, macOS, or Linux.
|
||||
- Intelligent transformations powered by AI, including grouping similar values to their canonical form and deriving columns by examples without custom code.
|
||||
- Capability to work with large, multiple files of different schema.
|
||||
- Scalability on a single machine by streaming data during processing rather than loading into memory.
|
||||
- Seamless integration with other Azure Machine Learning services. You can simply pass your prepared data file into `AutoMLConfig` object for automated machine learning training.
|
||||
|
||||
You will find in this repo:
|
||||
- [Getting Started Tutorial](tutorials/getting-started/getting-started.ipynb) for a quick introduction to the main features of Data Prep SDK.
|
||||
- [Case Study Notebooks](case-studies/new-york-taxi) that present an end-to-end data preparation tutorial where users start with small dataset, profile data with statistics summary, cleanse and perform feature engineering. All transformation steps are saved in a dataflow object. Users can easily reapply the same steps on the full dataset, and run it on Spark.
|
||||
- [How-To Guide Notebooks](how-to-guides) for more in-depth sample code at feature level.
|
||||
|
||||
## Installation
|
||||
Here are the [SDK installation steps](https://aka.ms/aml-data-prep-installation).
|
||||
|
||||
## Documentation
|
||||
Here is more information on how to use the new Data Prep SDK:
|
||||
- [SDK overview and API reference docs](http://aka.ms/data-prep-sdk) that show different classes, methods, and function parameters for the SDK.
|
||||
- [Tutorial: Prep NYC taxi data](https://docs.microsoft.com/azure/machine-learning/service/tutorial-data-prep) for regression modeling and then run automated machine learning to build the model.
|
||||
- [How to load data](https://docs.microsoft.com/azure/machine-learning/service/how-to-load-data) is an overview guide on how to load data using the Data Prep SDK.
|
||||
- [How to transform data](https://docs.microsoft.com/azure/machine-learning/service/how-to-transform-data) is an overview guide on how to transform data.
|
||||
- [How to write data](https://docs.microsoft.com/azure/machine-learning/service/how-to-write-data) is an overview guide on how to write data to different storage locations.
|
||||
|
||||
## Support
|
||||
|
||||
If you have any questions or feedback, send us an email at: [askamldataprep@microsoft.com](mailto:askamldataprep@microsoft.com).
|
||||
|
||||
## Release Notes
|
||||
|
||||
### 2019-05-28 (version 1.1.4)
|
||||
|
||||
New features
|
||||
- You can now use the following expression language functions to extract and parse datetime values into new columns.
|
||||
- `RegEx.extract_record()` extracts datetime elements into a new column.
|
||||
- `create_datetime()` creates datetime objects from separate datetime elements.
|
||||
- When calling `get_profile()`, you can now see that quantile columns are labeled as (est.) to clearly indicate that the values are approximations.
|
||||
- You can now use ** globbing when reading from Azure Blob Storage.
|
||||
- e.g. `dprep.read_csv(path='https://yourblob.blob.core.windows.net/yourcontainer/**/data/*.csv')`
|
||||
|
||||
Bug fixes
|
||||
- Fixed a bug related to reading a Parquet file from a remote source (Azure Blob).
|
||||
|
||||
### 2019-05-08 (version 1.1.3)
|
||||
|
||||
New features
|
||||
- Added support to read from a PostgresSQL database, either by calling `read_postgresql` or using a Datastore.
|
||||
- See examples in how-to guides:
|
||||
- [Data Ingestion notebook](https://aka.ms/aml-data-prep-ingestion-nb)
|
||||
- [Datastore notebook](https://aka.ms/aml-data-prep-datastore-nb)
|
||||
|
||||
Bug fixes and improvements
|
||||
- Fixed issues with column type conversion:
|
||||
- Now correctly converts a boolean or numeric column to a boolean column.
|
||||
- Now does not fail when attempting to set a date column to be date type.
|
||||
- Improved JoinType types and accompanying reference documentation. When joining two dataflows, you can now specify one of these types of join:
|
||||
- NONE, MATCH, INNER, UNMATCHLEFT, LEFTANTI, LEFTOUTER, UNMATCHRIGHT, RIGHTANTI, RIGHTOUTER, FULLANTI, FULL.
|
||||
- Improved data type inference to recognize more date formats.
|
||||
|
||||
### 2019-04-17 (version 1.1.2)
|
||||
|
||||
Note: Data Prep Python SDK will no longer install `numpy` and `pandas` packages. See [updated installation instructions](https://aka.ms/aml-data-prep-installation).
|
||||
|
||||
New features
|
||||
- You can now use the Pivot transform.
|
||||
- How-to guide: [Pivot notebook](https://aka.ms/aml-data-prep-pivot-nb)
|
||||
- You can now use regular expressions in native functions.
|
||||
- Examples:
|
||||
- `dflow.filter(dprep.RegEx('pattern').is_match(dflow['column_name']))`
|
||||
- `dflow.assert_value('column_name', dprep.RegEx('pattern').is_match(dprep.value))`
|
||||
- You can now use `to_upper` and `to_lower` functions in expression language.
|
||||
- You can now see the number of unique values of each column in a data profile.
|
||||
- For some of the commonly used reader steps, you can now pass in the `infer_column_types` argument. If it is set to `True`, Data Prep will attempt to detect and automatically convert column types.
|
||||
- `inference_arguments` is now deprecated.
|
||||
- You can now call `Dataflow.shape`.
|
||||
|
||||
Bug fixes and improvements
|
||||
- `keep_columns` now accepts an additional optional argument `validate_column_exists`, which checks if the result of `keep_columns` will contain any columns.
|
||||
- All reader steps (which read from a file) now accept an additional optional argument `verify_exists`.
|
||||
- Improved performance of reading from pandas dataframe and getting data profiles.
|
||||
- Fixed a bug where slicing a single step from a Dataflow failed with a single index.
|
||||
|
||||
### 2019-04-08 (version 1.1.1)
|
||||
|
||||
New features
|
||||
- You can read multiple Datastore/DataPath/DataReference sources using read_* transforms.
|
||||
- You can perform the following operations on columns to create a new column: division, floor, modulo, power, length.
|
||||
- Data Prep is now part of the Azure ML diagnostics suite and will log diagnostic information by default.
|
||||
- To turn this off, set this environment variable to true: DISABLE_DPREP_LOGGER
|
||||
|
||||
Bug fixes and improvements
|
||||
- Improved code documentation for commonly used classes and functions.
|
||||
- Fixed a bug in auto_read_file that failed to read Excel files.
|
||||
- Added option to overwrite the folder in read_pandas_dataframe.
|
||||
- Improved performance of dotnetcore2 dependency installation, and added support for Fedora 27/28 and Ubuntu 1804.
|
||||
- Improved the performance of reading from Azure Blobs.
|
||||
- Column type detection now supports columns of type Long.
|
||||
- Fixed a bug where some date values were being displayed as timestamps instead of Python datetime objects.
|
||||
- Fixed a bug where some type counts were being displayed as doubles instead of integers.
|
||||
|
||||
### 2019-03-25 (version 1.1.0)
|
||||
|
||||
Breaking changes
|
||||
- The concept of the Data Prep Package has been deprecated and is no longer supported. Instead of persisting multiple Dataflows in one Package, you can persist Dataflows individually.
|
||||
- How-to guide: [Opening and Saving Dataflows notebook](https://aka.ms/aml-data-prep-open-save-dataflows-nb)
|
||||
|
||||
New features
|
||||
- Data Prep can now recognize columns that match a particular Semantic Type, and split accordingly. The STypes currently supported include: email address, geographic coordinates (latitude & longitude), IPv4 and IPv6 addresses, US phone number, and US zip code.
|
||||
- How-to guide: [Semantic Types notebook](https://aka.ms/aml-data-prep-semantic-types-nb)
|
||||
- Data Prep now supports the following operations to generate a resultant column from two numeric columns: subtract, multiply, divide, and modulo.
|
||||
- You can call `verify_has_data()` on a Dataflow to check whether the Dataflow would produce records if executed.
|
||||
|
||||
Bug fixes and improvements
|
||||
- You can now specify the number of bins to use in a histogram for numeric column profiles.
|
||||
- The `read_pandas_dataframe` transform now requires the DataFrame to have string- or byte- typed column names.
|
||||
- Fixed a bug in the `fill_nulls` transform, where values were not correctly filled in if the column was missing.
|
||||
|
||||
### 2019-03-11 (version 1.0.17)
|
||||
|
||||
New features
|
||||
- Now supports adding two numeric columns to generate a resultant column using the expression language.
|
||||
|
||||
Bug fixes and improvements
|
||||
- Improved the documentation and parameter checking for random_split.
|
||||
|
||||
### 2019-02-27 (version 1.0.16)
|
||||
|
||||
Bug fix
|
||||
- Fixed a Service Principal authentication issue that was caused by an API change.
|
||||
|
||||
### 2019-02-25 (version 1.0.15)
|
||||
|
||||
New features
|
||||
- Data Prep now supports writing file streams from a dataflow. Also provides the ability to manipulate the file stream names to create new file names.
|
||||
- How-to guide: [Working With File Streams notebook](https://aka.ms/aml-data-prep-file-stream-nb)
|
||||
|
||||
Bug fixes and improvements
|
||||
- Improved performance of t-Digest on large data sets.
|
||||
- Data Prep now supports reading data from a DataPath.
|
||||
- One hot encoding now works on boolean and numeric columns.
|
||||
- Other miscellaneous bug fixes.
|
||||
|
||||
### 2019-02-11 (version 1.0.12)
|
||||
|
||||
New features
|
||||
- Data Prep now supports reading from an Azure SQL database using Datastore.
|
||||
|
||||
Changes
|
||||
- Significantly improved the memory performance of certain operations on large data.
|
||||
- `read_pandas_dataframe()` now requires `temp_folder` to be specified.
|
||||
- The `name` property on `ColumnProfile` has been deprecated - use `column_name` instead.
|
||||
|
||||
### 2019-01-28 (version 1.0.8)
|
||||
|
||||
Bug fixes
|
||||
- Significantly improved the performance of getting data profiles.
|
||||
- Fixed minor bugs related to error reporting.
|
||||
|
||||
### 2019-01-14 (version 1.0.7)
|
||||
|
||||
New features
|
||||
- Datastore improvements (documented in [Datastore how-to-guide](https://aka.ms/aml-data-prep-datastore-nb))
|
||||
- Added ability to read from and write to Azure File Share and ADLS Datastores in scale-up.
|
||||
- When using Datastores, Data Prep now supports using service principal authentication instead of interactive authentication.
|
||||
- Added support for wasb and wasbs urls.
|
||||
|
||||
### 2019-01-09 (version 1.0.6)
|
||||
|
||||
Bug fixes
|
||||
- Fixed bug with reading from public readable Azure Blob containers on Spark.
|
||||
|
||||
### 2018-12-19 (version 1.0.4)
|
||||
|
||||
New features
|
||||
- `to_bool` function now allows mismatched values to be converted to Error values. This is the new default mismatch behavior for `to_bool` and `set_column_types`, whereas the previous default behavior was to convert mismatched values to False.
|
||||
- When calling `to_pandas_dataframe`, there is a new option to interpret null/missing values in numeric columns as NaN.
|
||||
- Added ability to check the return type of some expressions to ensure type consistency and fail early.
|
||||
- You can now call `parse_json` to parse values in a column as JSON objects and expand them into multiple columns.
|
||||
|
||||
Bug fixes
|
||||
- Fixed a bug that crashed `set_column_types` in Python 3.5.2.
|
||||
- Fixed a bug that crashed when connecting to Datastore using an AML image.
|
||||
|
||||
### 2018-12-07 (version 0.5.3)
|
||||
|
||||
Fixed missing dependency issue for .NET Core2 on Ubuntu 16.
|
||||
|
||||
### 2018-12-03 (version 0.5.2)
|
||||
|
||||
Breaking changes
|
||||
- `SummaryFunction.N` was renamed to `SummaryFunction.Count`.
|
||||
|
||||
Bug fixes
|
||||
- Use latest AML Run Token when reading from and writing to datastores on remote runs. Previously, if the AML Run Token is updated in Python, the Data Prep runtime will not be updated with the updated AML Run Token.
|
||||
- Additional clearer error messages
|
||||
- to_spark_dataframe() will no longer crash when Spark uses Kryo serialization
|
||||
- Value Count Inspector can now show more than 1000 unique values
|
||||
- Random Split no longer fails if the original Dataflow doesn’t have a name
|
||||
|
||||
### 2018-11-19 (version 0.5.0)
|
||||
|
||||
New features
|
||||
- Created a new DataPrep CLI to execute DataPrep packages and view the data profile for a dataset or dataflow
|
||||
- Redesigned SetColumnType API to improve usability
|
||||
- Renamed smart_read_file to auto_read_file
|
||||
- Now includes skew and kurtosis in the Data Profile
|
||||
- Can sample with stratified sampling
|
||||
- Can read from zip files that contain CSV files
|
||||
- Can split datasets row-wise with Random Split (e.g. into test-train sets)
|
||||
- Can get all the column data types from a dataflow or a data profile by calling .dtypes
|
||||
- Can get the row count from a dataflow or a data profile by calling .row_count
|
||||
|
||||
Bug fixes
|
||||
- Fixed long to double conversion
|
||||
- Fixed assert after any add column
|
||||
- Fixed an issue with FuzzyGrouping, where it would not detect groups in some cases
|
||||
- Fixed sort function to respect multi-column sort order
|
||||
- Fixed and/or expressions to be similar to how Pandas handles them
|
||||
- Fixed reading from dbfs path.
|
||||
- Made error messages more understandable
|
||||
- Now no longer fails when reading on remote compute target using AML token
|
||||
- Now no longer fails on Linux DSVM
|
||||
- Now no longer crashes when non-string values are in string predicates
|
||||
- Now handles assertion errors when Dataflow should fail correctly
|
||||
- Now supports dbutils mounted storage locations on Azure Databricks
|
||||
|
||||
### 2018-11-05 (version 0.4.0)
|
||||
|
||||
New features
|
||||
- Type Count added to Data Profile
|
||||
- Value Count and Histogram is now available
|
||||
- More percentiles in Data Profile
|
||||
- The Median is available in Summarize
|
||||
- Python 3.7 is now supported
|
||||
- When you save a dataflow that contains datastores to a Data Prep package, the datastore information will be persisted as part of the Data Prep package
|
||||
- Writing to datastore is now supported
|
||||
|
||||
Bug fixes
|
||||
- 64bit unsigned integer overflows are now handled properly on Linux
|
||||
- Fixed incorrect text label for plain text files in smart_read
|
||||
- String column type now shows up in metrics view
|
||||
- Type count now is fixed to show ValueKinds mapped to single FieldType instead of individual ones
|
||||
- Write_to_csv no longer fails when path is provided as a string
|
||||
- When using Replace, leaving “find” blank will no longer fail
|
||||
|
||||
## Datasets License Information
|
||||
|
||||
IMPORTANT: Please read the notice and find out more about this NYC Taxi and Limousine Commission dataset here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
|
||||
|
||||
IMPORTANT: Please read the notice and find out more about this Chicago Police Department dataset here: https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
|
||||
|
||||

|
||||
# Azure Machine Learning Data Prep SDK
|
||||
|
||||
The Azure Machine Learning Data Prep SDK helps data scientists explore, cleanse and transform data for machine learning workflows in any Python environment.
|
||||
|
||||
Key benefits to the SDK:
|
||||
- Cross-platform functionality. Write with a single SDK and run it on Windows, macOS, or Linux.
|
||||
- Intelligent transformations powered by AI, including grouping similar values to their canonical form and deriving columns by examples without custom code.
|
||||
- Capability to work with large, multiple files of different schema.
|
||||
- Scalability on a single machine by streaming data during processing rather than loading into memory.
|
||||
- Seamless integration with other Azure Machine Learning services. You can simply pass your prepared data file into `AutoMLConfig` object for automated machine learning training.
|
||||
|
||||
You will find in this repo:
|
||||
- [Getting Started Tutorial](tutorials/getting-started/getting-started.ipynb) for a quick introduction to the main features of Data Prep SDK.
|
||||
- [Case Study Notebooks](case-studies/new-york-taxi) that present an end-to-end data preparation tutorial where users start with small dataset, profile data with statistics summary, cleanse and perform feature engineering. All transformation steps are saved in a dataflow object. Users can easily reapply the same steps on the full dataset, and run it on Spark.
|
||||
- [How-To Guide Notebooks](how-to-guides) for more in-depth sample code at feature level.
|
||||
|
||||
## Installation
|
||||
Here are the [SDK installation steps](https://aka.ms/aml-data-prep-installation).
|
||||
|
||||
## Documentation
|
||||
Here is more information on how to use the new Data Prep SDK:
|
||||
- [SDK overview and API reference docs](http://aka.ms/data-prep-sdk) that show different classes, methods, and function parameters for the SDK.
|
||||
- [Tutorial: Prep NYC taxi data](https://docs.microsoft.com/azure/machine-learning/service/tutorial-data-prep) for regression modeling and then run automated machine learning to build the model.
|
||||
- [How to load data](https://docs.microsoft.com/azure/machine-learning/service/how-to-load-data) is an overview guide on how to load data using the Data Prep SDK.
|
||||
- [How to transform data](https://docs.microsoft.com/azure/machine-learning/service/how-to-transform-data) is an overview guide on how to transform data.
|
||||
- [How to write data](https://docs.microsoft.com/azure/machine-learning/service/how-to-write-data) is an overview guide on how to write data to different storage locations.
|
||||
|
||||
## Support
|
||||
|
||||
If you have any questions or feedback, send us an email at: [askamldataprep@microsoft.com](mailto:askamldataprep@microsoft.com).
|
||||
|
||||
## Release Notes
|
||||
|
||||
### 2019-05-28 (version 1.1.4)
|
||||
|
||||
New features
|
||||
- You can now use the following expression language functions to extract and parse datetime values into new columns.
|
||||
- `RegEx.extract_record()` extracts datetime elements into a new column.
|
||||
- `create_datetime()` creates datetime objects from separate datetime elements.
|
||||
- When calling `get_profile()`, you can now see that quantile columns are labeled as (est.) to clearly indicate that the values are approximations.
|
||||
- You can now use ** globbing when reading from Azure Blob Storage.
|
||||
- e.g. `dprep.read_csv(path='https://yourblob.blob.core.windows.net/yourcontainer/**/data/*.csv')`
|
||||
|
||||
Bug fixes
|
||||
- Fixed a bug related to reading a Parquet file from a remote source (Azure Blob).
|
||||
|
||||
### 2019-05-08 (version 1.1.3)
|
||||
|
||||
New features
|
||||
- Added support to read from a PostgresSQL database, either by calling `read_postgresql` or using a Datastore.
|
||||
- See examples in how-to guides:
|
||||
- [Data Ingestion notebook](https://aka.ms/aml-data-prep-ingestion-nb)
|
||||
- [Datastore notebook](https://aka.ms/aml-data-prep-datastore-nb)
|
||||
|
||||
Bug fixes and improvements
|
||||
- Fixed issues with column type conversion:
|
||||
- Now correctly converts a boolean or numeric column to a boolean column.
|
||||
- Now does not fail when attempting to set a date column to be date type.
|
||||
- Improved JoinType types and accompanying reference documentation. When joining two dataflows, you can now specify one of these types of join:
|
||||
- NONE, MATCH, INNER, UNMATCHLEFT, LEFTANTI, LEFTOUTER, UNMATCHRIGHT, RIGHTANTI, RIGHTOUTER, FULLANTI, FULL.
|
||||
- Improved data type inference to recognize more date formats.
|
||||
|
||||
### 2019-04-17 (version 1.1.2)
|
||||
|
||||
Note: Data Prep Python SDK will no longer install `numpy` and `pandas` packages. See [updated installation instructions](https://aka.ms/aml-data-prep-installation).
|
||||
|
||||
New features
|
||||
- You can now use the Pivot transform.
|
||||
- How-to guide: [Pivot notebook](https://aka.ms/aml-data-prep-pivot-nb)
|
||||
- You can now use regular expressions in native functions.
|
||||
- Examples:
|
||||
- `dflow.filter(dprep.RegEx('pattern').is_match(dflow['column_name']))`
|
||||
- `dflow.assert_value('column_name', dprep.RegEx('pattern').is_match(dprep.value))`
|
||||
- You can now use `to_upper` and `to_lower` functions in expression language.
|
||||
- You can now see the number of unique values of each column in a data profile.
|
||||
- For some of the commonly used reader steps, you can now pass in the `infer_column_types` argument. If it is set to `True`, Data Prep will attempt to detect and automatically convert column types.
|
||||
- `inference_arguments` is now deprecated.
|
||||
- You can now call `Dataflow.shape`.
|
||||
|
||||
Bug fixes and improvements
|
||||
- `keep_columns` now accepts an additional optional argument `validate_column_exists`, which checks if the result of `keep_columns` will contain any columns.
|
||||
- All reader steps (which read from a file) now accept an additional optional argument `verify_exists`.
|
||||
- Improved performance of reading from pandas dataframe and getting data profiles.
|
||||
- Fixed a bug where slicing a single step from a Dataflow failed with a single index.
|
||||
|
||||
### 2019-04-08 (version 1.1.1)
|
||||
|
||||
New features
|
||||
- You can read multiple Datastore/DataPath/DataReference sources using read_* transforms.
|
||||
- You can perform the following operations on columns to create a new column: division, floor, modulo, power, length.
|
||||
- Data Prep is now part of the Azure ML diagnostics suite and will log diagnostic information by default.
|
||||
- To turn this off, set this environment variable to true: DISABLE_DPREP_LOGGER
|
||||
|
||||
Bug fixes and improvements
|
||||
- Improved code documentation for commonly used classes and functions.
|
||||
- Fixed a bug in auto_read_file that failed to read Excel files.
|
||||
- Added option to overwrite the folder in read_pandas_dataframe.
|
||||
- Improved performance of dotnetcore2 dependency installation, and added support for Fedora 27/28 and Ubuntu 1804.
|
||||
- Improved the performance of reading from Azure Blobs.
|
||||
- Column type detection now supports columns of type Long.
|
||||
- Fixed a bug where some date values were being displayed as timestamps instead of Python datetime objects.
|
||||
- Fixed a bug where some type counts were being displayed as doubles instead of integers.
|
||||
|
||||
### 2019-03-25 (version 1.1.0)
|
||||
|
||||
Breaking changes
|
||||
- The concept of the Data Prep Package has been deprecated and is no longer supported. Instead of persisting multiple Dataflows in one Package, you can persist Dataflows individually.
|
||||
- How-to guide: [Opening and Saving Dataflows notebook](https://aka.ms/aml-data-prep-open-save-dataflows-nb)
|
||||
|
||||
New features
|
||||
- Data Prep can now recognize columns that match a particular Semantic Type, and split accordingly. The STypes currently supported include: email address, geographic coordinates (latitude & longitude), IPv4 and IPv6 addresses, US phone number, and US zip code.
|
||||
- How-to guide: [Semantic Types notebook](https://aka.ms/aml-data-prep-semantic-types-nb)
|
||||
- Data Prep now supports the following operations to generate a resultant column from two numeric columns: subtract, multiply, divide, and modulo.
|
||||
- You can call `verify_has_data()` on a Dataflow to check whether the Dataflow would produce records if executed.
|
||||
|
||||
Bug fixes and improvements
|
||||
- You can now specify the number of bins to use in a histogram for numeric column profiles.
|
||||
- The `read_pandas_dataframe` transform now requires the DataFrame to have string- or byte- typed column names.
|
||||
- Fixed a bug in the `fill_nulls` transform, where values were not correctly filled in if the column was missing.
|
||||
|
||||
### 2019-03-11 (version 1.0.17)
|
||||
|
||||
New features
|
||||
- Now supports adding two numeric columns to generate a resultant column using the expression language.
|
||||
|
||||
Bug fixes and improvements
|
||||
- Improved the documentation and parameter checking for random_split.
|
||||
|
||||
### 2019-02-27 (version 1.0.16)
|
||||
|
||||
Bug fix
|
||||
- Fixed a Service Principal authentication issue that was caused by an API change.
|
||||
|
||||
### 2019-02-25 (version 1.0.15)
|
||||
|
||||
New features
|
||||
- Data Prep now supports writing file streams from a dataflow. Also provides the ability to manipulate the file stream names to create new file names.
|
||||
- How-to guide: [Working With File Streams notebook](https://aka.ms/aml-data-prep-file-stream-nb)
|
||||
|
||||
Bug fixes and improvements
|
||||
- Improved performance of t-Digest on large data sets.
|
||||
- Data Prep now supports reading data from a DataPath.
|
||||
- One hot encoding now works on boolean and numeric columns.
|
||||
- Other miscellaneous bug fixes.
|
||||
|
||||
### 2019-02-11 (version 1.0.12)
|
||||
|
||||
New features
|
||||
- Data Prep now supports reading from an Azure SQL database using Datastore.
|
||||
|
||||
Changes
|
||||
- Significantly improved the memory performance of certain operations on large data.
|
||||
- `read_pandas_dataframe()` now requires `temp_folder` to be specified.
|
||||
- The `name` property on `ColumnProfile` has been deprecated - use `column_name` instead.
|
||||
|
||||
### 2019-01-28 (version 1.0.8)
|
||||
|
||||
Bug fixes
|
||||
- Significantly improved the performance of getting data profiles.
|
||||
- Fixed minor bugs related to error reporting.
|
||||
|
||||
### 2019-01-14 (version 1.0.7)
|
||||
|
||||
New features
|
||||
- Datastore improvements (documented in [Datastore how-to-guide](https://aka.ms/aml-data-prep-datastore-nb))
|
||||
- Added ability to read from and write to Azure File Share and ADLS Datastores in scale-up.
|
||||
- When using Datastores, Data Prep now supports using service principal authentication instead of interactive authentication.
|
||||
- Added support for wasb and wasbs urls.
|
||||
|
||||
### 2019-01-09 (version 1.0.6)
|
||||
|
||||
Bug fixes
|
||||
- Fixed bug with reading from public readable Azure Blob containers on Spark.
|
||||
|
||||
### 2018-12-19 (version 1.0.4)
|
||||
|
||||
New features
|
||||
- `to_bool` function now allows mismatched values to be converted to Error values. This is the new default mismatch behavior for `to_bool` and `set_column_types`, whereas the previous default behavior was to convert mismatched values to False.
|
||||
- When calling `to_pandas_dataframe`, there is a new option to interpret null/missing values in numeric columns as NaN.
|
||||
- Added ability to check the return type of some expressions to ensure type consistency and fail early.
|
||||
- You can now call `parse_json` to parse values in a column as JSON objects and expand them into multiple columns.
|
||||
|
||||
Bug fixes
|
||||
- Fixed a bug that crashed `set_column_types` in Python 3.5.2.
|
||||
- Fixed a bug that crashed when connecting to Datastore using an AML image.
|
||||
|
||||
### 2018-12-07 (version 0.5.3)
|
||||
|
||||
Fixed missing dependency issue for .NET Core2 on Ubuntu 16.
|
||||
|
||||
### 2018-12-03 (version 0.5.2)
|
||||
|
||||
Breaking changes
|
||||
- `SummaryFunction.N` was renamed to `SummaryFunction.Count`.
|
||||
|
||||
Bug fixes
|
||||
- Use latest AML Run Token when reading from and writing to datastores on remote runs. Previously, if the AML Run Token is updated in Python, the Data Prep runtime will not be updated with the updated AML Run Token.
|
||||
- Additional clearer error messages
|
||||
- to_spark_dataframe() will no longer crash when Spark uses Kryo serialization
|
||||
- Value Count Inspector can now show more than 1000 unique values
|
||||
- Random Split no longer fails if the original Dataflow doesn’t have a name
|
||||
|
||||
### 2018-11-19 (version 0.5.0)
|
||||
|
||||
New features
|
||||
- Created a new DataPrep CLI to execute DataPrep packages and view the data profile for a dataset or dataflow
|
||||
- Redesigned SetColumnType API to improve usability
|
||||
- Renamed smart_read_file to auto_read_file
|
||||
- Now includes skew and kurtosis in the Data Profile
|
||||
- Can sample with stratified sampling
|
||||
- Can read from zip files that contain CSV files
|
||||
- Can split datasets row-wise with Random Split (e.g. into test-train sets)
|
||||
- Can get all the column data types from a dataflow or a data profile by calling .dtypes
|
||||
- Can get the row count from a dataflow or a data profile by calling .row_count
|
||||
|
||||
Bug fixes
|
||||
- Fixed long to double conversion
|
||||
- Fixed assert after any add column
|
||||
- Fixed an issue with FuzzyGrouping, where it would not detect groups in some cases
|
||||
- Fixed sort function to respect multi-column sort order
|
||||
- Fixed and/or expressions to be similar to how Pandas handles them
|
||||
- Fixed reading from dbfs path.
|
||||
- Made error messages more understandable
|
||||
- Now no longer fails when reading on remote compute target using AML token
|
||||
- Now no longer fails on Linux DSVM
|
||||
- Now no longer crashes when non-string values are in string predicates
|
||||
- Now handles assertion errors when Dataflow should fail correctly
|
||||
- Now supports dbutils mounted storage locations on Azure Databricks
|
||||
|
||||
### 2018-11-05 (version 0.4.0)
|
||||
|
||||
New features
|
||||
- Type Count added to Data Profile
|
||||
- Value Count and Histogram is now available
|
||||
- More percentiles in Data Profile
|
||||
- The Median is available in Summarize
|
||||
- Python 3.7 is now supported
|
||||
- When you save a dataflow that contains datastores to a Data Prep package, the datastore information will be persisted as part of the Data Prep package
|
||||
- Writing to datastore is now supported
|
||||
|
||||
Bug fixes
|
||||
- 64bit unsigned integer overflows are now handled properly on Linux
|
||||
- Fixed incorrect text label for plain text files in smart_read
|
||||
- String column type now shows up in metrics view
|
||||
- Type count now is fixed to show ValueKinds mapped to single FieldType instead of individual ones
|
||||
- Write_to_csv no longer fails when path is provided as a string
|
||||
- When using Replace, leaving “find” blank will no longer fail
|
||||
|
||||
## Datasets License Information
|
||||
|
||||
IMPORTANT: Please read the notice and find out more about this NYC Taxi and Limousine Commission dataset here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
|
||||
|
||||
IMPORTANT: Please read the notice and find out more about this Chicago Police Department dataset here: https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
|
||||
|
||||

|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,135 +1,135 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Scale-Out Data Preparation\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once we are done with preparing and featurizing the data locally, we can run the same steps on the full dataset in scale-out mode. The new york taxi cab data is about 300GB in total, which is perfect for scale-out. Let's start by downloading the package we saved earlier to disk. Feel free to run the `new_york_taxi_cab.ipynb` notebook to generate the package yourself, in which case you may comment out the download code and set the `package_path` to where the package is saved."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from tempfile import mkdtemp\n",
|
||||
"from os import path\n",
|
||||
"from urllib.request import urlretrieve\n",
|
||||
"\n",
|
||||
"dflow_root = mkdtemp()\n",
|
||||
"dflow_path = path.join(dflow_root, \"new_york_taxi.dprep\")\n",
|
||||
"print(\"Downloading Dataflow to: {}\".format(dflow_path))\n",
|
||||
"urlretrieve(\"https://dprepdata.blob.core.windows.net/demo/new_york_taxi_v2.dprep\", dflow_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's load the package we just downloaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"df = dprep.Dataflow.open(dflow_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's replace the datasources with the full dataset."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from uuid import uuid4\n",
|
||||
"\n",
|
||||
"other_step = df._get_steps()[7].arguments['dataflows'][0]['anonymousSteps'][0]\n",
|
||||
"other_step['id'] = str(uuid4())\n",
|
||||
"other_step['arguments']['path']['target'] = 1\n",
|
||||
"other_step['arguments']['path']['resourceDetails'][0]['path'] = 'https://wranglewestus.blob.core.windows.net/nyctaxi/yellow_tripdata*'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"green_dsource = dprep.BlobDataSource(\"https://wranglewestus.blob.core.windows.net/nyctaxi/green_tripdata*\")\n",
|
||||
"df = df.replace_datasource(green_dsource)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once we have replaced the datasource, we can now run the same steps on the full dataset. We will print the first 5 rows of the spark DataFrame. Since we are running on the full dataset, this might take a little while depending on your spark cluster's size."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"spark_df = df.to_spark_dataframe()\n",
|
||||
"spark_df.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"skip_execute_as_test": true
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"skip_execute_as_test": true,
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Scale-Out Data Preparation\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once we are done with preparing and featurizing the data locally, we can run the same steps on the full dataset in scale-out mode. The new york taxi cab data is about 300GB in total, which is perfect for scale-out. Let's start by downloading the package we saved earlier to disk. Feel free to run the `new_york_taxi_cab.ipynb` notebook to generate the package yourself, in which case you may comment out the download code and set the `package_path` to where the package is saved."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from tempfile import mkdtemp\n",
|
||||
"from os import path\n",
|
||||
"from urllib.request import urlretrieve\n",
|
||||
"\n",
|
||||
"dflow_root = mkdtemp()\n",
|
||||
"dflow_path = path.join(dflow_root, \"new_york_taxi.dprep\")\n",
|
||||
"print(\"Downloading Dataflow to: {}\".format(dflow_path))\n",
|
||||
"urlretrieve(\"https://dprepdata.blob.core.windows.net/demo/new_york_taxi_v2.dprep\", dflow_path)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's load the package we just downloaded."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"df = dprep.Dataflow.open(dflow_path)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's replace the datasources with the full dataset."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from uuid import uuid4\n",
|
||||
"\n",
|
||||
"other_step = df._get_steps()[7].arguments['dataflows'][0]['anonymousSteps'][0]\n",
|
||||
"other_step['id'] = str(uuid4())\n",
|
||||
"other_step['arguments']['path']['target'] = 1\n",
|
||||
"other_step['arguments']['path']['resourceDetails'][0]['path'] = 'https://wranglewestus.blob.core.windows.net/nyctaxi/yellow_tripdata*'"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"green_dsource = dprep.BlobDataSource(\"https://wranglewestus.blob.core.windows.net/nyctaxi/green_tripdata*\")\n",
|
||||
"df = df.replace_datasource(green_dsource)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once we have replaced the datasource, we can now run the same steps on the full dataset. We will print the first 5 rows of the spark DataFrame. Since we are running on the full dataset, this might take a little while depending on your spark cluster's size."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"spark_df = df.to_spark_dataframe()\n",
|
||||
"spark_df.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,45 +1,45 @@
|
||||
-----BEGIN PRIVATE KEY-----
|
||||
MIIEvwIBADANBgkqhkiG9w0BAQEFAASCBKkwggSlAgEAAoIBAQDmkkyF0BwipZow
|
||||
Wd1AMkRkySx0y079JPxpsYhv4i1xXKdoa9bpFqwoXmJpeQM1JWnU4UeZzFeM86qK
|
||||
AhQvL4KV4kibcP2ENvu2NKFEdotO3uxPJ+6GlcYwMYzy+tUj008KnnRZfTrR78sJ
|
||||
tIl3C6lnVL0ICihksG59P1sskRq3PvOjXLAdEZalwDjZ4ZPoNDZdj6nUjB2l8zqu
|
||||
pKAt5mR+bJ9Sox4yrDuNhMmFt5QsRDRe3wUqdV+C9OCWHmjlmsjrYw7p9YmjBDvC
|
||||
5U7mF0Mk/XeYFzj0pkXKQVqBL6xqig+q5ob0szYfg19iDeFhS3iIsRcJGEnRVW/A
|
||||
NpsBZyKrAgMBAAECggEBANlvP8C1F8NInhZYuIAwpzTQTh86Fxw8g9h8dijkh2wv
|
||||
LyQXBk07d1B+aZoDZ5X32UzKwcX04N9obfvFqBkzWZdVFJmZvUmwvEEActBoZkkT
|
||||
io+/HX5HweVy5PPCvbsSK6jc8uXtZcnSs4tMeJIOKkvqqnTpd1w00Y1FcQqfMC16
|
||||
4p7o8wbt6OFoFAYqcxeVYVwDzCTLZD3+iJaqmntkBkoDndJy52yXQmMq5z1wbQVp
|
||||
BL6+L9nTvmouy64jiHVSKOx8nnWThYfHsXoPv+rYywjeuK/v3hyaTAwogs36ooEn
|
||||
SnuTBRvJcumN9Q0XIVlxKMVBcGyyAP+0yNKGz5NQgdECgYEA/I/Uq1E3epPJgEWR
|
||||
Bub+LpCgwtrw/lgKncb/Q/AiE9qoXobUe4KNU8aGaNMb7uVNLckY7cOluLS6SQb3
|
||||
Mzwk2Jl0G3vk8rW46tZWvSYB8+zAR2Rz7seUOT9SE5OmvwpnHrnp3nRr1vvVd2bp
|
||||
Q/ypwMLrwWQN51Kr+oTS74bUbrkCgYEA6bXVIUyao7z2Q3qAr6h+6JEWDbkJA7hJ
|
||||
BjHIOXvxd1tMoJJX+X9+IE/2XoJaUkGCb0vrM/hi1cyQFmS4Or/J6IWSZu8oBpDr
|
||||
EBmIK3PF1nrzNvWD28wM46c6ScehyWSm/u4bJWSm9liTX3dv5Kpa6ym7yLKc3c0B
|
||||
ECpSJM+5SoMCgYEAq585Tukzn/IJPUcIk/4nv5C8DW0l0lAVdr2g/JOTNJajTwik
|
||||
HwHJ86G1+Elsc9wRpAlBDWCjnm4BIFrBZGl8SEuOoJaCL4PZEotwCbxoG09IIbtb
|
||||
JGkuifBDX9Y3ux3gkPqYt3e5SC99EVQ3MuHgoIJUHehVolmFUAkuJWIjvNECgYEA
|
||||
5pU0VspRuELzZdgzpxvDOooLDDcHodfslGQBfFXBA1Xc4IACtHMJaa/7D3vkyUtA
|
||||
+bYZtQjX2sEdWDq/WZdoCjXfIBfNkczhXt0R8G0lQFvGIu9QzUchYGrZo3mHMkBQ
|
||||
Uy1xMw9/e4YgwQwCJcW+Nk7Sq00uX9enuN9IdHFOCykCgYAqAGMK6CH1tlpjvHrf
|
||||
k+ZhigYxTXBlsVVvK1BIGGaiwzDpn65zeQp4aLOjSZkI1LuRi3tfTiZ321jRd64J
|
||||
4lGk5Jurqv5grDmxROX/U50wEYbI9ncu/thU7syUdxDiqxHPI2RMG50mRcm3a55p
|
||||
ZCNSqkMlcXyA0U1z8C1ILNUsbA==
|
||||
-----END PRIVATE KEY-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
MIICoTCCAYkCAgPoMA0GCSqGSIb3DQEBBQUAMBQxEjAQBgNVBAMMCUNMSS1Mb2dp
|
||||
bjAiGA8yMDE4MDcxMzIzMjA0N1oYDzIwMTkwNzEzMjMyMDQ5WjAUMRIwEAYDVQQD
|
||||
DAlDTEktTG9naW4wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDmkkyF
|
||||
0BwipZowWd1AMkRkySx0y079JPxpsYhv4i1xXKdoa9bpFqwoXmJpeQM1JWnU4UeZ
|
||||
zFeM86qKAhQvL4KV4kibcP2ENvu2NKFEdotO3uxPJ+6GlcYwMYzy+tUj008KnnRZ
|
||||
fTrR78sJtIl3C6lnVL0ICihksG59P1sskRq3PvOjXLAdEZalwDjZ4ZPoNDZdj6nU
|
||||
jB2l8zqupKAt5mR+bJ9Sox4yrDuNhMmFt5QsRDRe3wUqdV+C9OCWHmjlmsjrYw7p
|
||||
9YmjBDvC5U7mF0Mk/XeYFzj0pkXKQVqBL6xqig+q5ob0szYfg19iDeFhS3iIsRcJ
|
||||
GEnRVW/ANpsBZyKrAgMBAAEwDQYJKoZIhvcNAQEFBQADggEBAI4VlaFb9NsXMLdT
|
||||
Cw5/pk0Xo2Qi6483RGTy8vzrw88IE7f3juB/JWG+rayjtW5bBRx2fae4/ZIdZ4zg
|
||||
N2FDKn2PQPAc9m9pcKyUKUvWOC8ixSkrUmeQew0l1AXU0hsPSlJ7/7ZK4efoyB47
|
||||
hj71fsyKdyKbisZDcUFBq/S8PazdPF0YOD1W/4A2tW0cSMg+jmFWynuUTdWt3SU8
|
||||
CwBGqdiSKT5faJuYwIWnRXDEQS3ObRn1OFEfFdd4d2sxjxydWKRgnINnGlBdiFAT
|
||||
KzCozVr+75cO2ErH6x5C0hLQGG5BxXbaijyxyvaRNokTMVVv6OaDEnjzCGfJ72Yf
|
||||
2wgitNc=
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN PRIVATE KEY-----
|
||||
MIIEvwIBADANBgkqhkiG9w0BAQEFAASCBKkwggSlAgEAAoIBAQDmkkyF0BwipZow
|
||||
Wd1AMkRkySx0y079JPxpsYhv4i1xXKdoa9bpFqwoXmJpeQM1JWnU4UeZzFeM86qK
|
||||
AhQvL4KV4kibcP2ENvu2NKFEdotO3uxPJ+6GlcYwMYzy+tUj008KnnRZfTrR78sJ
|
||||
tIl3C6lnVL0ICihksG59P1sskRq3PvOjXLAdEZalwDjZ4ZPoNDZdj6nUjB2l8zqu
|
||||
pKAt5mR+bJ9Sox4yrDuNhMmFt5QsRDRe3wUqdV+C9OCWHmjlmsjrYw7p9YmjBDvC
|
||||
5U7mF0Mk/XeYFzj0pkXKQVqBL6xqig+q5ob0szYfg19iDeFhS3iIsRcJGEnRVW/A
|
||||
NpsBZyKrAgMBAAECggEBANlvP8C1F8NInhZYuIAwpzTQTh86Fxw8g9h8dijkh2wv
|
||||
LyQXBk07d1B+aZoDZ5X32UzKwcX04N9obfvFqBkzWZdVFJmZvUmwvEEActBoZkkT
|
||||
io+/HX5HweVy5PPCvbsSK6jc8uXtZcnSs4tMeJIOKkvqqnTpd1w00Y1FcQqfMC16
|
||||
4p7o8wbt6OFoFAYqcxeVYVwDzCTLZD3+iJaqmntkBkoDndJy52yXQmMq5z1wbQVp
|
||||
BL6+L9nTvmouy64jiHVSKOx8nnWThYfHsXoPv+rYywjeuK/v3hyaTAwogs36ooEn
|
||||
SnuTBRvJcumN9Q0XIVlxKMVBcGyyAP+0yNKGz5NQgdECgYEA/I/Uq1E3epPJgEWR
|
||||
Bub+LpCgwtrw/lgKncb/Q/AiE9qoXobUe4KNU8aGaNMb7uVNLckY7cOluLS6SQb3
|
||||
Mzwk2Jl0G3vk8rW46tZWvSYB8+zAR2Rz7seUOT9SE5OmvwpnHrnp3nRr1vvVd2bp
|
||||
Q/ypwMLrwWQN51Kr+oTS74bUbrkCgYEA6bXVIUyao7z2Q3qAr6h+6JEWDbkJA7hJ
|
||||
BjHIOXvxd1tMoJJX+X9+IE/2XoJaUkGCb0vrM/hi1cyQFmS4Or/J6IWSZu8oBpDr
|
||||
EBmIK3PF1nrzNvWD28wM46c6ScehyWSm/u4bJWSm9liTX3dv5Kpa6ym7yLKc3c0B
|
||||
ECpSJM+5SoMCgYEAq585Tukzn/IJPUcIk/4nv5C8DW0l0lAVdr2g/JOTNJajTwik
|
||||
HwHJ86G1+Elsc9wRpAlBDWCjnm4BIFrBZGl8SEuOoJaCL4PZEotwCbxoG09IIbtb
|
||||
JGkuifBDX9Y3ux3gkPqYt3e5SC99EVQ3MuHgoIJUHehVolmFUAkuJWIjvNECgYEA
|
||||
5pU0VspRuELzZdgzpxvDOooLDDcHodfslGQBfFXBA1Xc4IACtHMJaa/7D3vkyUtA
|
||||
+bYZtQjX2sEdWDq/WZdoCjXfIBfNkczhXt0R8G0lQFvGIu9QzUchYGrZo3mHMkBQ
|
||||
Uy1xMw9/e4YgwQwCJcW+Nk7Sq00uX9enuN9IdHFOCykCgYAqAGMK6CH1tlpjvHrf
|
||||
k+ZhigYxTXBlsVVvK1BIGGaiwzDpn65zeQp4aLOjSZkI1LuRi3tfTiZ321jRd64J
|
||||
4lGk5Jurqv5grDmxROX/U50wEYbI9ncu/thU7syUdxDiqxHPI2RMG50mRcm3a55p
|
||||
ZCNSqkMlcXyA0U1z8C1ILNUsbA==
|
||||
-----END PRIVATE KEY-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
MIICoTCCAYkCAgPoMA0GCSqGSIb3DQEBBQUAMBQxEjAQBgNVBAMMCUNMSS1Mb2dp
|
||||
bjAiGA8yMDE4MDcxMzIzMjA0N1oYDzIwMTkwNzEzMjMyMDQ5WjAUMRIwEAYDVQQD
|
||||
DAlDTEktTG9naW4wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDmkkyF
|
||||
0BwipZowWd1AMkRkySx0y079JPxpsYhv4i1xXKdoa9bpFqwoXmJpeQM1JWnU4UeZ
|
||||
zFeM86qKAhQvL4KV4kibcP2ENvu2NKFEdotO3uxPJ+6GlcYwMYzy+tUj008KnnRZ
|
||||
fTrR78sJtIl3C6lnVL0ICihksG59P1sskRq3PvOjXLAdEZalwDjZ4ZPoNDZdj6nU
|
||||
jB2l8zqupKAt5mR+bJ9Sox4yrDuNhMmFt5QsRDRe3wUqdV+C9OCWHmjlmsjrYw7p
|
||||
9YmjBDvC5U7mF0Mk/XeYFzj0pkXKQVqBL6xqig+q5ob0szYfg19iDeFhS3iIsRcJ
|
||||
GEnRVW/ANpsBZyKrAgMBAAEwDQYJKoZIhvcNAQEFBQADggEBAI4VlaFb9NsXMLdT
|
||||
Cw5/pk0Xo2Qi6483RGTy8vzrw88IE7f3juB/JWG+rayjtW5bBRx2fae4/ZIdZ4zg
|
||||
N2FDKn2PQPAc9m9pcKyUKUvWOC8ixSkrUmeQew0l1AXU0hsPSlJ7/7ZK4efoyB47
|
||||
hj71fsyKdyKbisZDcUFBq/S8PazdPF0YOD1W/4A2tW0cSMg+jmFWynuUTdWt3SU8
|
||||
CwBGqdiSKT5faJuYwIWnRXDEQS3ObRn1OFEfFdd4d2sxjxydWKRgnINnGlBdiFAT
|
||||
KzCozVr+75cO2ErH6x5C0hLQGG5BxXbaijyxyvaRNokTMVVv6OaDEnjzCGfJ72Yf
|
||||
2wgitNc=
|
||||
-----END CERTIFICATE-----
|
||||
|
||||
@@ -1,54 +1,54 @@
|
||||
"Retrieved from https://en.wikipedia.org/wiki/Chicago_City_Council on November 6, 2018"
|
||||
|
||||
|
||||
Ward,Name,Took Office,Party
|
||||
1,Proco Joe Moreno,2010*,Dem
|
||||
2,Brian Hopkins,2015,Dem
|
||||
3,Pat Dowell,2007,Dem
|
||||
4,Sophia King,2016*,Dem
|
||||
5,Leslie Hairston,1999,Dem
|
||||
6,Roderick Sawyer,2011,Dem
|
||||
7,Gregory Mitchell,2015,Dem
|
||||
8,Michelle A. Harris,2006*,Dem
|
||||
9,Anthony Beale,1999,Dem
|
||||
10,Susie Sadlowski Garza,2015,Dem
|
||||
11,Patrick Daley Thompson,2015,Dem
|
||||
12,George Cardenas,2003,Dem
|
||||
13,Marty Quinn,2011,Dem
|
||||
14,Edward M. Burke,1969,Dem
|
||||
15,Raymond Lopez,2015,Dem
|
||||
16,Toni Foulkes,2007,Dem
|
||||
17,David H. Moore,2015,Dem
|
||||
18,Derrick Curtis,2015,Dem
|
||||
19,Matthew O'Shea,2011,Dem
|
||||
20,Willie Cochran,2007,Dem
|
||||
21,Howard Brookins Jr.,2003,Dem
|
||||
22,Ricardo Muñoz,1993*,Dem
|
||||
23,Silvana Tabares,2018*,Dem
|
||||
24,"Michael Scott, Jr.",2015,Dem
|
||||
25,Daniel Solis,1996*,Dem
|
||||
26,Roberto Maldonado,2009*,Dem
|
||||
27,"Walter Burnett, Jr.",1995,Dem
|
||||
28,Jason Ervin,2011*,Dem
|
||||
29,Chris Taliaferro,2015,Dem
|
||||
30,Ariel Reboyras,2003,Dem
|
||||
31,Milly Santiago,2015,Dem
|
||||
32,Scott Waguespack,2007,Dem
|
||||
33,Deb Mell,2013*,Dem
|
||||
34,Carrie Austin,1994*,Dem
|
||||
35,Carlos Ramirez-Rosa,2015,Dem
|
||||
36,Gilbert Villegas,2015,Dem
|
||||
37,Emma Mitts,2000*,Dem
|
||||
38,Nicholas Sposato,2011,Ind
|
||||
39,Margaret Laurino,1994*,Dem
|
||||
40,Patrick J. O'Connor,1983,Dem
|
||||
41,Anthony Napolitano,2015,Rep
|
||||
42,Brendan Reilly,2007,Dem
|
||||
43,Michele Smith,2011,Dem
|
||||
44,Thomas M. Tunney,2002*,Dem
|
||||
45,John Arena,2011,Dem
|
||||
46,James Cappleman,2011,Dem
|
||||
47,Ameya Pawar,2011,Dem
|
||||
48,Harry Osterman,2011,Dem
|
||||
49,Joe Moore,1991,Dem
|
||||
50,Debra Silverstein,2011,Dem
|
||||
"Retrieved from https://en.wikipedia.org/wiki/Chicago_City_Council on November 6, 2018"
|
||||
|
||||
|
||||
Ward,Name,Took Office,Party
|
||||
1,Proco Joe Moreno,2010*,Dem
|
||||
2,Brian Hopkins,2015,Dem
|
||||
3,Pat Dowell,2007,Dem
|
||||
4,Sophia King,2016*,Dem
|
||||
5,Leslie Hairston,1999,Dem
|
||||
6,Roderick Sawyer,2011,Dem
|
||||
7,Gregory Mitchell,2015,Dem
|
||||
8,Michelle A. Harris,2006*,Dem
|
||||
9,Anthony Beale,1999,Dem
|
||||
10,Susie Sadlowski Garza,2015,Dem
|
||||
11,Patrick Daley Thompson,2015,Dem
|
||||
12,George Cardenas,2003,Dem
|
||||
13,Marty Quinn,2011,Dem
|
||||
14,Edward M. Burke,1969,Dem
|
||||
15,Raymond Lopez,2015,Dem
|
||||
16,Toni Foulkes,2007,Dem
|
||||
17,David H. Moore,2015,Dem
|
||||
18,Derrick Curtis,2015,Dem
|
||||
19,Matthew O'Shea,2011,Dem
|
||||
20,Willie Cochran,2007,Dem
|
||||
21,Howard Brookins Jr.,2003,Dem
|
||||
22,Ricardo Muñoz,1993*,Dem
|
||||
23,Silvana Tabares,2018*,Dem
|
||||
24,"Michael Scott, Jr.",2015,Dem
|
||||
25,Daniel Solis,1996*,Dem
|
||||
26,Roberto Maldonado,2009*,Dem
|
||||
27,"Walter Burnett, Jr.",1995,Dem
|
||||
28,Jason Ervin,2011*,Dem
|
||||
29,Chris Taliaferro,2015,Dem
|
||||
30,Ariel Reboyras,2003,Dem
|
||||
31,Milly Santiago,2015,Dem
|
||||
32,Scott Waguespack,2007,Dem
|
||||
33,Deb Mell,2013*,Dem
|
||||
34,Carrie Austin,1994*,Dem
|
||||
35,Carlos Ramirez-Rosa,2015,Dem
|
||||
36,Gilbert Villegas,2015,Dem
|
||||
37,Emma Mitts,2000*,Dem
|
||||
38,Nicholas Sposato,2011,Ind
|
||||
39,Margaret Laurino,1994*,Dem
|
||||
40,Patrick J. O'Connor,1983,Dem
|
||||
41,Anthony Napolitano,2015,Rep
|
||||
42,Brendan Reilly,2007,Dem
|
||||
43,Michele Smith,2011,Dem
|
||||
44,Thomas M. Tunney,2002*,Dem
|
||||
45,John Arena,2011,Dem
|
||||
46,James Cappleman,2011,Dem
|
||||
47,Ameya Pawar,2011,Dem
|
||||
48,Harry Osterman,2011,Dem
|
||||
49,Joe Moore,1991,Dem
|
||||
50,Debra Silverstein,2011,Dem
|
||||
|
||||
|
@@ -1,15 +1,15 @@
|
||||
File updated 11/2/2018
|
||||
|
||||
|
||||
|
||||
ID|Case Number|Date|Block|IUCR|Primary Type|Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|Updated On|Latitude|Longitude|Location
|
||||
10140490|HY329907|07/05/2015 11:50:00 PM|050XX N NEWLAND AVE|0820|THEFT|$500 AND UNDER|STREET|false|false|1613|016|41|10|06|1129230|1933315|2015|07/12/2015 12:42:46 PM|41.973309466|-87.800174996|(41.973309466, -87.800174996)
|
||||
10139776|HY329265|07/05/2015 11:30:00 PM|011XX W MORSE AVE|0460|BATTERY|SIMPLE|STREET|false|true|2431|024|49|1|08B|1167370|1946271|2015|07/12/2015 12:42:46 PM|42.008124017|-87.65955018|(42.008124017, -87.65955018)
|
||||
10140270|HY329253|07/05/2015 11:20:00 PM|121XX S FRONT AVE|0486|BATTERY|DOMESTIC BATTERY SIMPLE|STREET|false|true|0532||9|53|08B|||2015|07/12/2015 12:42:46 PM|||
|
||||
10139885|HY329308|07/05/2015 11:19:00 PM|051XX W DIVISION ST|0610|BURGLARY|FORCIBLE ENTRY|SMALL RETAIL STORE|false|false|1531|015|37|25|05|1141721|1907465|2015|07/12/2015 12:42:46 PM|41.902152027|-87.754883404|(41.902152027, -87.754883404)
|
||||
10140379|HY329556|07/05/2015 11:00:00 PM|012XX W LAKE ST|0930|MOTOR VEHICLE THEFT|THEFT/RECOVERY: AUTOMOBILE|STREET|false|false|1215|012|27|28|07|1168413|1901632|2015|07/12/2015 12:42:46 PM|41.885610142|-87.657008701|(41.885610142, -87.657008701)
|
||||
10140868|HY330421|07/05/2015 10:54:00 PM|118XX S PEORIA ST|1320|CRIMINAL DAMAGE|TO VEHICLE|VEHICLE NON-COMMERCIAL|false|false|0524|005|34|53|14|1172409|1826485|2015|07/12/2015 12:42:46 PM|41.6793109|-87.644545209|(41.6793109, -87.644545209)
|
||||
10139762|HY329232|07/05/2015 10:42:00 PM|026XX W 37TH PL|1020|ARSON|BY FIRE|VACANT LOT/LAND|false|false|0911|009|12|58|09|1159436|1879658|2015|07/12/2015 12:42:46 PM|41.825500607|-87.690578042|(41.825500607, -87.690578042)
|
||||
10139722|HY329228|07/05/2015 10:30:00 PM|016XX S CENTRAL PARK AVE|1811|NARCOTICS|POSS: CANNABIS 30GMS OR LESS|ALLEY|true|false|1021|010|24|29|18|1152687|1891389|2015|07/12/2015 12:42:46 PM|41.857827814|-87.715028789|(41.857827814, -87.715028789)
|
||||
10139774|HY329209|07/05/2015 10:15:00 PM|048XX N ASHLAND AVE|1310|CRIMINAL DAMAGE|TO PROPERTY|APARTMENT|false|false|2032|020|46|3|14|1164821|1932394|2015|07/12/2015 12:42:46 PM|41.970099796|-87.669324377|(41.970099796, -87.669324377)
|
||||
10139697|HY329177|07/05/2015 10:10:00 PM|058XX S ARTESIAN AVE|1320|CRIMINAL DAMAGE|TO VEHICLE|ALLEY|false|false|0824|008|16|63|14|1160997|1865851|2015|07/12/2015 12:42:46 PM|41.787580282|-87.685233078|(41.787580282, -87.685233078)
|
||||
File updated 11/2/2018
|
||||
|
||||
|
||||
|
||||
ID|Case Number|Date|Block|IUCR|Primary Type|Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|Updated On|Latitude|Longitude|Location
|
||||
10140490|HY329907|07/05/2015 11:50:00 PM|050XX N NEWLAND AVE|0820|THEFT|$500 AND UNDER|STREET|false|false|1613|016|41|10|06|1129230|1933315|2015|07/12/2015 12:42:46 PM|41.973309466|-87.800174996|(41.973309466, -87.800174996)
|
||||
10139776|HY329265|07/05/2015 11:30:00 PM|011XX W MORSE AVE|0460|BATTERY|SIMPLE|STREET|false|true|2431|024|49|1|08B|1167370|1946271|2015|07/12/2015 12:42:46 PM|42.008124017|-87.65955018|(42.008124017, -87.65955018)
|
||||
10140270|HY329253|07/05/2015 11:20:00 PM|121XX S FRONT AVE|0486|BATTERY|DOMESTIC BATTERY SIMPLE|STREET|false|true|0532||9|53|08B|||2015|07/12/2015 12:42:46 PM|||
|
||||
10139885|HY329308|07/05/2015 11:19:00 PM|051XX W DIVISION ST|0610|BURGLARY|FORCIBLE ENTRY|SMALL RETAIL STORE|false|false|1531|015|37|25|05|1141721|1907465|2015|07/12/2015 12:42:46 PM|41.902152027|-87.754883404|(41.902152027, -87.754883404)
|
||||
10140379|HY329556|07/05/2015 11:00:00 PM|012XX W LAKE ST|0930|MOTOR VEHICLE THEFT|THEFT/RECOVERY: AUTOMOBILE|STREET|false|false|1215|012|27|28|07|1168413|1901632|2015|07/12/2015 12:42:46 PM|41.885610142|-87.657008701|(41.885610142, -87.657008701)
|
||||
10140868|HY330421|07/05/2015 10:54:00 PM|118XX S PEORIA ST|1320|CRIMINAL DAMAGE|TO VEHICLE|VEHICLE NON-COMMERCIAL|false|false|0524|005|34|53|14|1172409|1826485|2015|07/12/2015 12:42:46 PM|41.6793109|-87.644545209|(41.6793109, -87.644545209)
|
||||
10139762|HY329232|07/05/2015 10:42:00 PM|026XX W 37TH PL|1020|ARSON|BY FIRE|VACANT LOT/LAND|false|false|0911|009|12|58|09|1159436|1879658|2015|07/12/2015 12:42:46 PM|41.825500607|-87.690578042|(41.825500607, -87.690578042)
|
||||
10139722|HY329228|07/05/2015 10:30:00 PM|016XX S CENTRAL PARK AVE|1811|NARCOTICS|POSS: CANNABIS 30GMS OR LESS|ALLEY|true|false|1021|010|24|29|18|1152687|1891389|2015|07/12/2015 12:42:46 PM|41.857827814|-87.715028789|(41.857827814, -87.715028789)
|
||||
10139774|HY329209|07/05/2015 10:15:00 PM|048XX N ASHLAND AVE|1310|CRIMINAL DAMAGE|TO PROPERTY|APARTMENT|false|false|2032|020|46|3|14|1164821|1932394|2015|07/12/2015 12:42:46 PM|41.970099796|-87.669324377|(41.970099796, -87.669324377)
|
||||
10139697|HY329177|07/05/2015 10:10:00 PM|058XX S ARTESIAN AVE|1320|CRIMINAL DAMAGE|TO VEHICLE|ALLEY|false|false|0824|008|16|63|14|1160997|1865851|2015|07/12/2015 12:42:46 PM|41.787580282|-87.685233078|(41.787580282, -87.685233078)
|
||||
|
||||
|
File diff suppressed because it is too large
Load Diff
@@ -1,11 +1,11 @@
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
10498554,HZ239907,4/4/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,FALSE,FALSE,531,5,9,50,11,1183356,1831503,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
|
||||
10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,FALSE,FALSE,614,6,21,71,6,1166776,1850053,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
|
||||
10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,FALSE,FALSE,2211,22,19,74,11,,,2016,5/12/2016 15:50,,,
|
||||
10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,FALSE,FALSE,531,5,9,49,10,,,2016,5/13/2016 15:51,,,
|
||||
10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",FALSE,FALSE,1712,17,40,13,6,,,2016,5/25/2016 15:59,,,
|
||||
10535059,HZ278872,4/15/2016 4:30,004XX S KILBOURN AVE,810,THEFT,OVER $500,RESIDENCE,FALSE,FALSE,1131,11,24,26,6,,,2016,5/25/2016 15:59,,,
|
||||
10499802,HZ240778,4/15/2016 10:00,010XX N MILWAUKEE AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,RESIDENCE,FALSE,FALSE,1213,12,27,24,11,,,2016,5/27/2016 15:45,,,
|
||||
10522293,HZ264802,4/15/2016 16:00,019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,RESTAURANT,FALSE,FALSE,1424,14,1,24,11,1163094,1908003,2016,5/16/2016 15:48,41.90320604,-87.67636193,"(41.903206037, -87.676361925)"
|
||||
10523111,HZ265911,4/15/2016 8:00,061XX N SHERIDAN RD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,FALSE,FALSE,2433,24,48,77,11,,,2016,5/16/2016 15:50,,,
|
||||
10525877,HZ268138,4/15/2016 15:00,023XX W EASTWOOD AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,FALSE,FALSE,1911,19,47,4,11,,,2016,5/18/2016 15:50,,,
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
10498554,HZ239907,4/4/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,FALSE,FALSE,531,5,9,50,11,1183356,1831503,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
|
||||
10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,FALSE,FALSE,614,6,21,71,6,1166776,1850053,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
|
||||
10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,FALSE,FALSE,2211,22,19,74,11,,,2016,5/12/2016 15:50,,,
|
||||
10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,FALSE,FALSE,531,5,9,49,10,,,2016,5/13/2016 15:51,,,
|
||||
10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",FALSE,FALSE,1712,17,40,13,6,,,2016,5/25/2016 15:59,,,
|
||||
10535059,HZ278872,4/15/2016 4:30,004XX S KILBOURN AVE,810,THEFT,OVER $500,RESIDENCE,FALSE,FALSE,1131,11,24,26,6,,,2016,5/25/2016 15:59,,,
|
||||
10499802,HZ240778,4/15/2016 10:00,010XX N MILWAUKEE AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,RESIDENCE,FALSE,FALSE,1213,12,27,24,11,,,2016,5/27/2016 15:45,,,
|
||||
10522293,HZ264802,4/15/2016 16:00,019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,RESTAURANT,FALSE,FALSE,1424,14,1,24,11,1163094,1908003,2016,5/16/2016 15:48,41.90320604,-87.67636193,"(41.903206037, -87.676361925)"
|
||||
10523111,HZ265911,4/15/2016 8:00,061XX N SHERIDAN RD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,FALSE,FALSE,2433,24,48,77,11,,,2016,5/16/2016 15:50,,,
|
||||
10525877,HZ268138,4/15/2016 15:00,023XX W EASTWOOD AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,FALSE,FALSE,1911,19,47,4,11,,,2016,5/18/2016 15:50,,,
|
||||
|
||||
|
@@ -1,11 +1,11 @@
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
10378283,HZ114126,1/10/2016 11:00,033XX W IRVING PARK RD,610,BURGLARY,FORCIBLE ENTRY,RESIDENCE-GARAGE,TRUE,FALSE,1724,17,33,16,5,1153593,1926401,2016,5/22/2016 15:51,41.95388599,-87.71077048,"(41.95388599, -87.710770479)"
|
||||
10382154,HZ118288,1/10/2016 21:00,055XX S FRANCISCO AVE,1754,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,FALSE,TRUE,824,8,14,63,2,1157983,1867874,2016,6/1/2016 15:51,41.79319349,-87.69622926,"(41.793193489, -87.696229255)"
|
||||
10374287,HZ110730,1/10/2016 11:50,043XX W ARMITAGE AVE,5002,OTHER OFFENSE,OTHER VEHICLE OFFENSE,STREET,FALSE,TRUE,2522,25,30,20,26,1146917,1912931,2016,6/7/2016 15:55,41.91705356,-87.73565764,"(41.917053561, -87.735657637)"
|
||||
10374662,HZ110403,1/10/2016 1:30,073XX S CLAREMONT AVE,497,BATTERY,AGGRAVATED DOMESTIC BATTERY: OTHER DANG WEAPON,STREET,FALSE,TRUE,835,8,18,66,04B,1162007,1855951,2016,2/4/2016 15:44,41.76039236,-87.68180481,"(41.760392356, -87.681804812)"
|
||||
10374720,HZ110836,1/10/2016 7:30,079XX S RHODES AVE,890,THEFT,FROM BUILDING,OTHER,FALSE,FALSE,624,6,6,44,6,1181279,1852568,2016,2/4/2016 15:44,41.75068679,-87.61127681,"(41.75068679, -87.611276811)"
|
||||
10375178,HZ110832,1/10/2016 14:20,057XX S KEDZIE AVE,460,BATTERY,SIMPLE,RESTAURANT,FALSE,FALSE,824,8,14,63,08B,1156029,1866379,2016,2/4/2016 15:44,41.78913051,-87.7034346,"(41.78913051, -87.703434602)"
|
||||
10398695,HZ135279,1/10/2016 23:00,031XX S PARNELL AVE,620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE-GARAGE,FALSE,FALSE,915,9,11,60,5,1173138,1884117,2016,2/4/2016 15:44,41.8374442,-87.64017699,"(41.837444199, -87.640176991)"
|
||||
10402270,HZ138745,1/10/2016 11:00,051XX S ELIZABETH ST,620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,FALSE,FALSE,934,9,16,61,5,,,2016,2/4/2016 6:53,,,
|
||||
10380619,HZ116583,1/10/2016 9:41,091XX S PAXTON AVE,4387,OTHER OFFENSE,VIOLATE ORDER OF PROTECTION,RESIDENCE,TRUE,TRUE,413,4,7,48,26,1192434,1844707,2016,2/2/2016 15:56,41.72885134,-87.57065553,"(41.728851343, -87.570655525)"
|
||||
10400131,HZ136171,1/10/2016 18:00,0000X W TERMINAL ST,810,THEFT,OVER $500,AIRPORT BUILDING NON-TERMINAL - SECURE AREA,FALSE,FALSE,1651,16,41,76,6,,,2016,2/2/2016 15:58,,,
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
10378283,HZ114126,1/10/2016 11:00,033XX W IRVING PARK RD,610,BURGLARY,FORCIBLE ENTRY,RESIDENCE-GARAGE,TRUE,FALSE,1724,17,33,16,5,1153593,1926401,2016,5/22/2016 15:51,41.95388599,-87.71077048,"(41.95388599, -87.710770479)"
|
||||
10382154,HZ118288,1/10/2016 21:00,055XX S FRANCISCO AVE,1754,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,FALSE,TRUE,824,8,14,63,2,1157983,1867874,2016,6/1/2016 15:51,41.79319349,-87.69622926,"(41.793193489, -87.696229255)"
|
||||
10374287,HZ110730,1/10/2016 11:50,043XX W ARMITAGE AVE,5002,OTHER OFFENSE,OTHER VEHICLE OFFENSE,STREET,FALSE,TRUE,2522,25,30,20,26,1146917,1912931,2016,6/7/2016 15:55,41.91705356,-87.73565764,"(41.917053561, -87.735657637)"
|
||||
10374662,HZ110403,1/10/2016 1:30,073XX S CLAREMONT AVE,497,BATTERY,AGGRAVATED DOMESTIC BATTERY: OTHER DANG WEAPON,STREET,FALSE,TRUE,835,8,18,66,04B,1162007,1855951,2016,2/4/2016 15:44,41.76039236,-87.68180481,"(41.760392356, -87.681804812)"
|
||||
10374720,HZ110836,1/10/2016 7:30,079XX S RHODES AVE,890,THEFT,FROM BUILDING,OTHER,FALSE,FALSE,624,6,6,44,6,1181279,1852568,2016,2/4/2016 15:44,41.75068679,-87.61127681,"(41.75068679, -87.611276811)"
|
||||
10375178,HZ110832,1/10/2016 14:20,057XX S KEDZIE AVE,460,BATTERY,SIMPLE,RESTAURANT,FALSE,FALSE,824,8,14,63,08B,1156029,1866379,2016,2/4/2016 15:44,41.78913051,-87.7034346,"(41.78913051, -87.703434602)"
|
||||
10398695,HZ135279,1/10/2016 23:00,031XX S PARNELL AVE,620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE-GARAGE,FALSE,FALSE,915,9,11,60,5,1173138,1884117,2016,2/4/2016 15:44,41.8374442,-87.64017699,"(41.837444199, -87.640176991)"
|
||||
10402270,HZ138745,1/10/2016 11:00,051XX S ELIZABETH ST,620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,FALSE,FALSE,934,9,16,61,5,,,2016,2/4/2016 6:53,,,
|
||||
10380619,HZ116583,1/10/2016 9:41,091XX S PAXTON AVE,4387,OTHER OFFENSE,VIOLATE ORDER OF PROTECTION,RESIDENCE,TRUE,TRUE,413,4,7,48,26,1192434,1844707,2016,2/2/2016 15:56,41.72885134,-87.57065553,"(41.728851343, -87.570655525)"
|
||||
10400131,HZ136171,1/10/2016 18:00,0000X W TERMINAL ST,810,THEFT,OVER $500,AIRPORT BUILDING NON-TERMINAL - SECURE AREA,FALSE,FALSE,1651,16,41,76,6,,,2016,2/2/2016 15:58,,,
|
||||
|
||||
|
@@ -1,204 +1,204 @@
|
||||
{
|
||||
"id": "75637565-60ad-4baa-87d3-396a7930cfe7",
|
||||
"blocks": [
|
||||
{
|
||||
"id": "ba5a8061-129e-4618-953a-ce3e89c8f2cb",
|
||||
"type": "Microsoft.DPrep.GetFilesBlock",
|
||||
"arguments": {
|
||||
"path": {
|
||||
"target": 0,
|
||||
"resourceDetails": [
|
||||
{
|
||||
"path": "./crime-spring.csv"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "1b345643-6b60-4ca1-99f9-2a64ae932a23",
|
||||
"type": "Microsoft.DPrep.ParseDelimitedBlock",
|
||||
"arguments": {
|
||||
"columnHeadersMode": 1,
|
||||
"fileEncoding": 0,
|
||||
"handleQuotedLineBreaks": false,
|
||||
"preview": false,
|
||||
"separator": ",",
|
||||
"skipRowsMode": 0
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "12cf73a2-1487-4915-bfa7-c86be7de08c0",
|
||||
"type": "Microsoft.DPrep.SetColumnTypesBlock",
|
||||
"arguments": {
|
||||
"columnConversion": [
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "ID"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "IUCR"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Domestic"
|
||||
}
|
||||
},
|
||||
"typeProperty": 1
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Beat"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "District"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Ward"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Community Area"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Year"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Longitude"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Arrest"
|
||||
}
|
||||
},
|
||||
"typeProperty": 1
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "X Coordinate"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Updated On"
|
||||
}
|
||||
},
|
||||
"typeArguments": {
|
||||
"dateTimeFormats": [
|
||||
"%m/%d/%Y %I:%M:%S %p"
|
||||
]
|
||||
},
|
||||
"typeProperty": 4
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Date"
|
||||
}
|
||||
},
|
||||
"typeArguments": {
|
||||
"dateTimeFormats": [
|
||||
"%m/%d/%Y %I:%M:%S %p"
|
||||
]
|
||||
},
|
||||
"typeProperty": 4
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Y Coordinate"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Latitude"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
}
|
||||
]
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "dfd62543-9285-412b-a930-0aeaaffde699",
|
||||
"type": "Microsoft.DPrep.HandlePathColumnBlock",
|
||||
"arguments": {
|
||||
"pathColumnOperation": 0
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
}
|
||||
],
|
||||
"inspectors": []
|
||||
{
|
||||
"id": "75637565-60ad-4baa-87d3-396a7930cfe7",
|
||||
"blocks": [
|
||||
{
|
||||
"id": "ba5a8061-129e-4618-953a-ce3e89c8f2cb",
|
||||
"type": "Microsoft.DPrep.GetFilesBlock",
|
||||
"arguments": {
|
||||
"path": {
|
||||
"target": 0,
|
||||
"resourceDetails": [
|
||||
{
|
||||
"path": "./crime-spring.csv"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "1b345643-6b60-4ca1-99f9-2a64ae932a23",
|
||||
"type": "Microsoft.DPrep.ParseDelimitedBlock",
|
||||
"arguments": {
|
||||
"columnHeadersMode": 1,
|
||||
"fileEncoding": 0,
|
||||
"handleQuotedLineBreaks": false,
|
||||
"preview": false,
|
||||
"separator": ",",
|
||||
"skipRowsMode": 0
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "12cf73a2-1487-4915-bfa7-c86be7de08c0",
|
||||
"type": "Microsoft.DPrep.SetColumnTypesBlock",
|
||||
"arguments": {
|
||||
"columnConversion": [
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "ID"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "IUCR"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Domestic"
|
||||
}
|
||||
},
|
||||
"typeProperty": 1
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Beat"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "District"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Ward"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Community Area"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Year"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Longitude"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Arrest"
|
||||
}
|
||||
},
|
||||
"typeProperty": 1
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "X Coordinate"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Updated On"
|
||||
}
|
||||
},
|
||||
"typeArguments": {
|
||||
"dateTimeFormats": [
|
||||
"%m/%d/%Y %I:%M:%S %p"
|
||||
]
|
||||
},
|
||||
"typeProperty": 4
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Date"
|
||||
}
|
||||
},
|
||||
"typeArguments": {
|
||||
"dateTimeFormats": [
|
||||
"%m/%d/%Y %I:%M:%S %p"
|
||||
]
|
||||
},
|
||||
"typeProperty": 4
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Y Coordinate"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
},
|
||||
{
|
||||
"column": {
|
||||
"type": 2,
|
||||
"details": {
|
||||
"selectedColumn": "Latitude"
|
||||
}
|
||||
},
|
||||
"typeProperty": 3
|
||||
}
|
||||
]
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "dfd62543-9285-412b-a930-0aeaaffde699",
|
||||
"type": "Microsoft.DPrep.HandlePathColumnBlock",
|
||||
"arguments": {
|
||||
"pathColumnOperation": 0
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
}
|
||||
],
|
||||
"inspectors": []
|
||||
}
|
||||
@@ -1,10 +1,10 @@
|
||||
10140490 HY329907 7/5/2015 23:50 050XX N NEWLAND AVE 820 THEFT
|
||||
10139776 HY329265 7/5/2015 23:30 011XX W MORSE AVE 460 BATTERY
|
||||
10140270 HY329253 7/5/2015 23:20 121XX S FRONT AVE 486 BATTERY
|
||||
10139885 HY329308 7/5/2015 23:19 051XX W DIVISION ST 610 BURGLARY
|
||||
10140379 HY329556 7/5/2015 23:00 012XX W LAKE ST 930 MOTOR VEHICLE THEFT
|
||||
10140868 HY330421 7/5/2015 22:54 118XX S PEORIA ST 1320 CRIMINAL DAMAGE
|
||||
10139762 HY329232 7/5/2015 22:42 026XX W 37TH PL 1020 ARSON
|
||||
10139722 HY329228 7/5/2015 22:30 016XX S CENTRAL PARK AVE 1811 NARCOTICS
|
||||
10139774 HY329209 7/5/2015 22:15 048XX N ASHLAND AVE 1310 CRIMINAL DAMAGE
|
||||
10139697 HY329177 7/5/2015 22:10 058XX S ARTESIAN AVE 1320 CRIMINAL DAMAGE
|
||||
10140490 HY329907 7/5/2015 23:50 050XX N NEWLAND AVE 820 THEFT
|
||||
10139776 HY329265 7/5/2015 23:30 011XX W MORSE AVE 460 BATTERY
|
||||
10140270 HY329253 7/5/2015 23:20 121XX S FRONT AVE 486 BATTERY
|
||||
10139885 HY329308 7/5/2015 23:19 051XX W DIVISION ST 610 BURGLARY
|
||||
10140379 HY329556 7/5/2015 23:00 012XX W LAKE ST 930 MOTOR VEHICLE THEFT
|
||||
10140868 HY330421 7/5/2015 22:54 118XX S PEORIA ST 1320 CRIMINAL DAMAGE
|
||||
10139762 HY329232 7/5/2015 22:42 026XX W 37TH PL 1020 ARSON
|
||||
10139722 HY329228 7/5/2015 22:30 016XX S CENTRAL PARK AVE 1811 NARCOTICS
|
||||
10139774 HY329209 7/5/2015 22:15 048XX N ASHLAND AVE 1310 CRIMINAL DAMAGE
|
||||
10139697 HY329177 7/5/2015 22:10 058XX S ARTESIAN AVE 1320 CRIMINAL DAMAGE
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,FALSE,FALSE,531,5,9,50,11,1183356,1831503,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
|
||||
10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,FALSE,FALSE,614,6,21,71,6,1166776,1850053,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
|
||||
10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,FALSE,FALSE,2211,22,19,74,11,,,2016,5/12/2016 15:50,,,
|
||||
10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,FALSE,FALSE,531,5,9,49,10,,,2016,5/13/2016 15:51,,,
|
||||
10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",FALSE,FALSE,1712,17,40,13,6,,,2016,5/25/2016 15:59,,,
|
||||
10535059,HZ278872,4/15/2016 4:30,004XX S KILBOURN AVE,810,THEFT,OVER $500,RESIDENCE,FALSE,FALSE,1131,11,24,26,6,,,2016,5/25/2016 15:59,,,
|
||||
10499802,HZ240778,4/15/2016 10:00,010XX N MILWAUKEE AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,RESIDENCE,FALSE,FALSE,1213,12,27,24,11,,,2016,5/27/2016 15:45,,,
|
||||
10522293,HZ264802,4/15/2016 16:00,019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,RESTAURANT,FALSE,FALSE,1424,14,1,24,11,1163094,1908003,2016,5/16/2016 15:48,41.90320604,-87.67636193,"(41.903206037, -87.676361925)"
|
||||
10523111,HZ265911,4/15/2016 8:00,061XX N SHERIDAN RD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,FALSE,FALSE,2433,24,48,77,11,,,2016,5/16/2016 15:50,,,
|
||||
10525877,HZ268138,4/15/2016 15:00,023XX W EASTWOOD AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,FALSE,FALSE,1911,19,47,4,11,,,2016,5/18/2016 15:50,,,
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
|
||||
10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,FALSE,FALSE,531,5,9,50,11,1183356,1831503,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
|
||||
10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,FALSE,FALSE,614,6,21,71,6,1166776,1850053,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
|
||||
10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,FALSE,FALSE,2211,22,19,74,11,,,2016,5/12/2016 15:50,,,
|
||||
10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,FALSE,FALSE,531,5,9,49,10,,,2016,5/13/2016 15:51,,,
|
||||
10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",FALSE,FALSE,1712,17,40,13,6,,,2016,5/25/2016 15:59,,,
|
||||
10535059,HZ278872,4/15/2016 4:30,004XX S KILBOURN AVE,810,THEFT,OVER $500,RESIDENCE,FALSE,FALSE,1131,11,24,26,6,,,2016,5/25/2016 15:59,,,
|
||||
10499802,HZ240778,4/15/2016 10:00,010XX N MILWAUKEE AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,RESIDENCE,FALSE,FALSE,1213,12,27,24,11,,,2016,5/27/2016 15:45,,,
|
||||
10522293,HZ264802,4/15/2016 16:00,019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,RESTAURANT,FALSE,FALSE,1424,14,1,24,11,1163094,1908003,2016,5/16/2016 15:48,41.90320604,-87.67636193,"(41.903206037, -87.676361925)"
|
||||
10523111,HZ265911,4/15/2016 8:00,061XX N SHERIDAN RD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,FALSE,FALSE,2433,24,48,77,11,,,2016,5/16/2016 15:50,,,
|
||||
10525877,HZ268138,4/15/2016 15:00,023XX W EASTWOOD AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,FALSE,FALSE,1911,19,47,4,11,,,2016,5/18/2016 15:50,,,
|
||||
|
||||
|
@@ -1,10 +1,10 @@
|
||||
10140490 HY329907 7/5/2015 23:50 050XX N NEWLAND AVE 820 THEFT
|
||||
10139776 HY329265 7/5/2015 23:30 011XX W MORSE AVE 460 BATTERY
|
||||
10140270 HY329253 7/5/2015 23:20 121XX S FRONT AVE 486 BATTERY
|
||||
10139885 HY329308 7/5/2015 23:19 051XX W DIVISION ST 610 BURGLARY
|
||||
10140379 HY329556 7/5/2015 23:00 012XX W LAKE ST 930 MOTOR VEHICLE THEFT
|
||||
10140868 HY330421 7/5/2015 22:54 118XX S PEORIA ST 1320 CRIMINAL DAMAGE
|
||||
10139762 HY329232 7/5/2015 22:42 026XX W 37TH PL 1020 ARSON
|
||||
10139722 HY329228 7/5/2015 22:30 016XX S CENTRAL PARK AVE 1811 NARCOTICS
|
||||
10139774 HY329209 7/5/2015 22:15 048XX N ASHLAND AVE 1310 CRIMINAL DAMAGE
|
||||
10139697 HY329177 7/5/2015 22:10 058XX S ARTESIAN AVE 1320 CRIMINAL DAMAGE
|
||||
10140490 HY329907 7/5/2015 23:50 050XX N NEWLAND AVE 820 THEFT
|
||||
10139776 HY329265 7/5/2015 23:30 011XX W MORSE AVE 460 BATTERY
|
||||
10140270 HY329253 7/5/2015 23:20 121XX S FRONT AVE 486 BATTERY
|
||||
10139885 HY329308 7/5/2015 23:19 051XX W DIVISION ST 610 BURGLARY
|
||||
10140379 HY329556 7/5/2015 23:00 012XX W LAKE ST 930 MOTOR VEHICLE THEFT
|
||||
10140868 HY330421 7/5/2015 22:54 118XX S PEORIA ST 1320 CRIMINAL DAMAGE
|
||||
10139762 HY329232 7/5/2015 22:42 026XX W 37TH PL 1020 ARSON
|
||||
10139722 HY329228 7/5/2015 22:30 016XX S CENTRAL PARK AVE 1811 NARCOTICS
|
||||
10139774 HY329209 7/5/2015 22:15 048XX N ASHLAND AVE 1310 CRIMINAL DAMAGE
|
||||
10139697 HY329177 7/5/2015 22:10 058XX S ARTESIAN AVE 1320 CRIMINAL DAMAGE
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
ID |CaseNumber| |Completed|
|
||||
10140490 |HY329907| |Y|
|
||||
10139776 |HY329265| |Y|
|
||||
10140270 |HY329253| |N|
|
||||
10139885 |HY329308| |Y|
|
||||
10140379 |HY329556| |N|
|
||||
10140868 |HY330421| |N|
|
||||
10139762 |HY329232| |N|
|
||||
10139722 |HY329228| |Y|
|
||||
10139774 |HY329209| |N|
|
||||
ID |CaseNumber| |Completed|
|
||||
10140490 |HY329907| |Y|
|
||||
10139776 |HY329265| |Y|
|
||||
10140270 |HY329253| |N|
|
||||
10139885 |HY329308| |Y|
|
||||
10140379 |HY329556| |N|
|
||||
10140868 |HY330421| |N|
|
||||
10139762 |HY329232| |N|
|
||||
10139722 |HY329228| |Y|
|
||||
10139774 |HY329209| |N|
|
||||
10139697 |HY329177| |N|
|
||||
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,4 +1,4 @@
|
||||
def transform(df, index):
|
||||
df['Latitude'].fillna('0', inplace=True)
|
||||
df['Longitude'].fillna('0', inplace=True)
|
||||
return df
|
||||
def transform(df, index):
|
||||
df['Latitude'].fillna('0', inplace=True)
|
||||
df['Longitude'].fillna('0', inplace=True)
|
||||
return df
|
||||
|
||||
@@ -1,251 +1,251 @@
|
||||
median_income
|
||||
4.4896
|
||||
2.1029
|
||||
2.3889
|
||||
3.707
|
||||
6.4788
|
||||
4.4074
|
||||
5.2907
|
||||
1.5156
|
||||
8.4411
|
||||
4.4085
|
||||
2.1439
|
||||
2.8971
|
||||
6.1008
|
||||
3.5258
|
||||
2.7694
|
||||
2.2356
|
||||
1.9509
|
||||
4.0905
|
||||
3.6726
|
||||
3.1696
|
||||
2.5389
|
||||
3.0319
|
||||
4.6779
|
||||
2.9076
|
||||
2.8616
|
||||
1.4722
|
||||
5.6413
|
||||
2.1167
|
||||
4.7308
|
||||
4.8173
|
||||
2.3438
|
||||
1.7333
|
||||
1.4429
|
||||
2.3253
|
||||
2.4022
|
||||
3.4048
|
||||
6.6073
|
||||
4.1080000000000005
|
||||
4.2829
|
||||
1.5727
|
||||
2.5211
|
||||
4.2679
|
||||
4.7328
|
||||
4.7069
|
||||
2.465
|
||||
5.0267
|
||||
2.8043
|
||||
2.4053
|
||||
1.2176
|
||||
2.39
|
||||
3.6364
|
||||
6.0162
|
||||
2.8088
|
||||
3.3984
|
||||
4.5
|
||||
3.9079
|
||||
4.9618
|
||||
2.9344
|
||||
2.4283
|
||||
3.7388
|
||||
1.6021
|
||||
2.3352
|
||||
4.0982
|
||||
1.9531
|
||||
3.2386
|
||||
5.1169
|
||||
4.692
|
||||
4.0
|
||||
6.4238
|
||||
3.7375
|
||||
2.8233
|
||||
2.8009
|
||||
3.767
|
||||
3.6761
|
||||
5.0282
|
||||
3.5296
|
||||
5.215
|
||||
4.0125
|
||||
9.4667
|
||||
5.9062
|
||||
3.9864
|
||||
2.0734
|
||||
2.875
|
||||
3.3611
|
||||
2.8214
|
||||
0.9946
|
||||
4.5446
|
||||
4.6908
|
||||
9.3198
|
||||
1.2826
|
||||
2.4943
|
||||
10.1882
|
||||
4.6731
|
||||
4.375
|
||||
2.8173
|
||||
2.0903
|
||||
2.725
|
||||
2.8547
|
||||
2.25
|
||||
1.9444
|
||||
1.7167
|
||||
1.9342
|
||||
4.9524
|
||||
3.65
|
||||
3.0856
|
||||
3.2396
|
||||
2.9324
|
||||
3.495
|
||||
1.9818
|
||||
4.6964
|
||||
3.925
|
||||
3.625
|
||||
2.9688
|
||||
4.0417
|
||||
9.7956
|
||||
3.8732
|
||||
2.6998
|
||||
2.006
|
||||
4.25
|
||||
3.1839999999999997
|
||||
5.9658
|
||||
2.628
|
||||
2.5057
|
||||
5.155
|
||||
4.6
|
||||
4.6681
|
||||
5.5942
|
||||
5.1104
|
||||
3.0759
|
||||
3.5757
|
||||
3.6845
|
||||
6.4667
|
||||
5.273
|
||||
3.0635
|
||||
11.2866
|
||||
4.0444
|
||||
5.2541
|
||||
5.5791
|
||||
4.5375
|
||||
9.8144
|
||||
6.7257
|
||||
4.1442
|
||||
4.0313
|
||||
2.2791
|
||||
4.1679
|
||||
3.2852
|
||||
3.2768
|
||||
5.021
|
||||
4.875
|
||||
4.419
|
||||
3.3272
|
||||
4.2386
|
||||
1.245
|
||||
5.152
|
||||
4.8125
|
||||
2.1638
|
||||
7.1621
|
||||
1.5372
|
||||
10.0481
|
||||
3.3869
|
||||
5.4591
|
||||
4.4318
|
||||
6.5044
|
||||
4.2865
|
||||
3.0461
|
||||
11.3283
|
||||
2.7026
|
||||
3.016
|
||||
3.0943
|
||||
3.225
|
||||
6.187
|
||||
3.8158
|
||||
3.0147
|
||||
15.0
|
||||
3.1364
|
||||
2.9
|
||||
5.5941
|
||||
3.4028
|
||||
6.0062
|
||||
8.3792
|
||||
3.8036
|
||||
2.0926
|
||||
6.7703
|
||||
4.2569
|
||||
4.744
|
||||
9.7037
|
||||
5.1292
|
||||
2.3148
|
||||
3.3021
|
||||
1.95
|
||||
3.025
|
||||
2.6523
|
||||
1.2188
|
||||
5.827999999999999
|
||||
3.1587
|
||||
2.45
|
||||
2.3851
|
||||
2.1221
|
||||
3.5313
|
||||
3.4821
|
||||
7.8252
|
||||
5.1878
|
||||
3.7459
|
||||
6.0097
|
||||
2.3194
|
||||
4.2061
|
||||
2.267
|
||||
2.2109
|
||||
2.7589
|
||||
2.6553
|
||||
6.3325
|
||||
5.7233
|
||||
4.337
|
||||
3.9667
|
||||
5.8623
|
||||
1.6806
|
||||
3.5851
|
||||
2.9716
|
||||
3.9
|
||||
2.7431
|
||||
3.3621
|
||||
1.9464
|
||||
7.3518
|
||||
4.775
|
||||
3.5968
|
||||
6.221
|
||||
10.0968
|
||||
1.9483
|
||||
2.0469
|
||||
3.725
|
||||
3.675
|
||||
1.8529
|
||||
1.7159
|
||||
1.7386
|
||||
3.6687
|
||||
3.4671
|
||||
4.8233
|
||||
4.3036
|
||||
1.6488
|
||||
2.9453
|
||||
5.0096
|
||||
3.175
|
||||
4.2031
|
||||
3.1667
|
||||
5.7204
|
||||
3.375
|
||||
6.5483
|
||||
4.2206
|
||||
2.6631
|
||||
3.5363
|
||||
median_income
|
||||
4.4896
|
||||
2.1029
|
||||
2.3889
|
||||
3.707
|
||||
6.4788
|
||||
4.4074
|
||||
5.2907
|
||||
1.5156
|
||||
8.4411
|
||||
4.4085
|
||||
2.1439
|
||||
2.8971
|
||||
6.1008
|
||||
3.5258
|
||||
2.7694
|
||||
2.2356
|
||||
1.9509
|
||||
4.0905
|
||||
3.6726
|
||||
3.1696
|
||||
2.5389
|
||||
3.0319
|
||||
4.6779
|
||||
2.9076
|
||||
2.8616
|
||||
1.4722
|
||||
5.6413
|
||||
2.1167
|
||||
4.7308
|
||||
4.8173
|
||||
2.3438
|
||||
1.7333
|
||||
1.4429
|
||||
2.3253
|
||||
2.4022
|
||||
3.4048
|
||||
6.6073
|
||||
4.1080000000000005
|
||||
4.2829
|
||||
1.5727
|
||||
2.5211
|
||||
4.2679
|
||||
4.7328
|
||||
4.7069
|
||||
2.465
|
||||
5.0267
|
||||
2.8043
|
||||
2.4053
|
||||
1.2176
|
||||
2.39
|
||||
3.6364
|
||||
6.0162
|
||||
2.8088
|
||||
3.3984
|
||||
4.5
|
||||
3.9079
|
||||
4.9618
|
||||
2.9344
|
||||
2.4283
|
||||
3.7388
|
||||
1.6021
|
||||
2.3352
|
||||
4.0982
|
||||
1.9531
|
||||
3.2386
|
||||
5.1169
|
||||
4.692
|
||||
4.0
|
||||
6.4238
|
||||
3.7375
|
||||
2.8233
|
||||
2.8009
|
||||
3.767
|
||||
3.6761
|
||||
5.0282
|
||||
3.5296
|
||||
5.215
|
||||
4.0125
|
||||
9.4667
|
||||
5.9062
|
||||
3.9864
|
||||
2.0734
|
||||
2.875
|
||||
3.3611
|
||||
2.8214
|
||||
0.9946
|
||||
4.5446
|
||||
4.6908
|
||||
9.3198
|
||||
1.2826
|
||||
2.4943
|
||||
10.1882
|
||||
4.6731
|
||||
4.375
|
||||
2.8173
|
||||
2.0903
|
||||
2.725
|
||||
2.8547
|
||||
2.25
|
||||
1.9444
|
||||
1.7167
|
||||
1.9342
|
||||
4.9524
|
||||
3.65
|
||||
3.0856
|
||||
3.2396
|
||||
2.9324
|
||||
3.495
|
||||
1.9818
|
||||
4.6964
|
||||
3.925
|
||||
3.625
|
||||
2.9688
|
||||
4.0417
|
||||
9.7956
|
||||
3.8732
|
||||
2.6998
|
||||
2.006
|
||||
4.25
|
||||
3.1839999999999997
|
||||
5.9658
|
||||
2.628
|
||||
2.5057
|
||||
5.155
|
||||
4.6
|
||||
4.6681
|
||||
5.5942
|
||||
5.1104
|
||||
3.0759
|
||||
3.5757
|
||||
3.6845
|
||||
6.4667
|
||||
5.273
|
||||
3.0635
|
||||
11.2866
|
||||
4.0444
|
||||
5.2541
|
||||
5.5791
|
||||
4.5375
|
||||
9.8144
|
||||
6.7257
|
||||
4.1442
|
||||
4.0313
|
||||
2.2791
|
||||
4.1679
|
||||
3.2852
|
||||
3.2768
|
||||
5.021
|
||||
4.875
|
||||
4.419
|
||||
3.3272
|
||||
4.2386
|
||||
1.245
|
||||
5.152
|
||||
4.8125
|
||||
2.1638
|
||||
7.1621
|
||||
1.5372
|
||||
10.0481
|
||||
3.3869
|
||||
5.4591
|
||||
4.4318
|
||||
6.5044
|
||||
4.2865
|
||||
3.0461
|
||||
11.3283
|
||||
2.7026
|
||||
3.016
|
||||
3.0943
|
||||
3.225
|
||||
6.187
|
||||
3.8158
|
||||
3.0147
|
||||
15.0
|
||||
3.1364
|
||||
2.9
|
||||
5.5941
|
||||
3.4028
|
||||
6.0062
|
||||
8.3792
|
||||
3.8036
|
||||
2.0926
|
||||
6.7703
|
||||
4.2569
|
||||
4.744
|
||||
9.7037
|
||||
5.1292
|
||||
2.3148
|
||||
3.3021
|
||||
1.95
|
||||
3.025
|
||||
2.6523
|
||||
1.2188
|
||||
5.827999999999999
|
||||
3.1587
|
||||
2.45
|
||||
2.3851
|
||||
2.1221
|
||||
3.5313
|
||||
3.4821
|
||||
7.8252
|
||||
5.1878
|
||||
3.7459
|
||||
6.0097
|
||||
2.3194
|
||||
4.2061
|
||||
2.267
|
||||
2.2109
|
||||
2.7589
|
||||
2.6553
|
||||
6.3325
|
||||
5.7233
|
||||
4.337
|
||||
3.9667
|
||||
5.8623
|
||||
1.6806
|
||||
3.5851
|
||||
2.9716
|
||||
3.9
|
||||
2.7431
|
||||
3.3621
|
||||
1.9464
|
||||
7.3518
|
||||
4.775
|
||||
3.5968
|
||||
6.221
|
||||
10.0968
|
||||
1.9483
|
||||
2.0469
|
||||
3.725
|
||||
3.675
|
||||
1.8529
|
||||
1.7159
|
||||
1.7386
|
||||
3.6687
|
||||
3.4671
|
||||
4.8233
|
||||
4.3036
|
||||
1.6488
|
||||
2.9453
|
||||
5.0096
|
||||
3.175
|
||||
4.2031
|
||||
3.1667
|
||||
5.7204
|
||||
3.375
|
||||
6.5483
|
||||
4.2206
|
||||
2.6631
|
||||
3.5363
|
||||
|
||||
|
@@ -1,251 +1,251 @@
|
||||
median_income,median_income_uniform,median_income_normal
|
||||
4.4896,0.688927015969381,0.4928112398942898
|
||||
2.1029,0.16242159563866576,-0.9845540061415601
|
||||
2.3889,0.20433495515563627,-0.8262365643809355
|
||||
3.707,0.5167832475474021,0.04208177981426486
|
||||
6.4788,0.7918154943685715,0.8127367410844715
|
||||
4.4074,0.6708459812590735,0.44225037403262457
|
||||
5.2907,0.7627885954411082,0.71530142355699
|
||||
1.5156,0.07635265842077493,-1.430040774223022
|
||||
8.4411,0.8397571522806675,0.9934602851577232
|
||||
4.4085,0.6710879415775812,0.44291928632820865
|
||||
2.1439,0.16843015417081886,-0.960387338010023
|
||||
2.8971,0.3028380993334766,-0.5162551753422808
|
||||
6.1008,0.7825804402531089,0.780937730437047
|
||||
3.5258,0.47180713824983866,-0.07072794893657437
|
||||
2.7694,0.2685175231133089,-0.6173027607303029
|
||||
2.2356,0.18186880825370766,-0.9082661605741446
|
||||
1.9509,0.14014596400726886,-1.0796637810446446
|
||||
4.0905,0.6011394131362455,0.25629744014575906
|
||||
3.6726,0.5092164884958867,0.023104366055624576
|
||||
3.1696,0.3760750376263169,-0.31580558952782856
|
||||
2.5389,0.2263174863708306,-0.7510293827903433
|
||||
3.0319,0.3390668673403568,-0.4150111583096138
|
||||
4.6779,0.7303462232193919,0.61386044030116
|
||||
2.9076,0.3056600731025585,-0.5081899386067169
|
||||
2.8616,0.29329714039991395,-0.5437779548183906
|
||||
1.4722,0.0699923793891787,-1.475847787309027
|
||||
5.6413,0.7713542302900003,0.7433140959507171
|
||||
2.1167,0.1644439885104636,-0.9763562034967139
|
||||
4.7308,0.7419823149003563,0.6494688564486208
|
||||
4.8173,0.7512227895726955,0.6783427156995535
|
||||
2.3438,0.19772554077026783,-0.8497733992585795
|
||||
1.7333,0.10825663872442697,-1.235852818864947
|
||||
1.4429,0.06569845829181076,-1.5086161030849279
|
||||
2.3253,0.19501436192039387,-0.8595652745311925
|
||||
2.4022,0.20628407292338352,-0.8193826233304142
|
||||
3.4048,0.4392872500537518,-0.15277653846062797
|
||||
6.6073,0.7949549241406269,0.8237349930522744
|
||||
4.1080000000000005,0.6049887818397783,0.2662814785234653
|
||||
4.2829,0.6434604724825127,0.3677239840415221
|
||||
1.5727,0.0847206753033589,-1.3740010688161062
|
||||
2.5211,0.22370889266662755,-0.7597270013437459
|
||||
4.2679,0.6401610135937705,0.35888920867930674
|
||||
4.7328,0.7424222427521885,0.6508311116191982
|
||||
4.7069,0.7367251770709602,0.6332817852325006
|
||||
2.465,0.21548742599214485,-0.7875245686101101
|
||||
5.0267,0.7563387163763406,0.6945735910123612
|
||||
2.8043,0.277897226402924,-0.5890996132954226
|
||||
2.4053,0.20673837856849753,-0.8177906162820535
|
||||
1.2176,0.03268069640658888,-1.8427782558151542
|
||||
2.39,0.2044961603845477,-0.8256682268514074
|
||||
3.6364,0.501253794377722,0.0031428016114447123
|
||||
6.0162,0.780513547189172,0.7739287859106961
|
||||
2.8088,0.2791066437325306,-0.5854974399696314
|
||||
3.3984,0.4375671898516448,-0.15714017005326586
|
||||
4.5,0.6912146407989088,0.4992962189785684
|
||||
3.9079,0.5609740002639567,0.1534391166508579
|
||||
4.9618,0.7547531211062519,0.6895237101278
|
||||
2.9344,0.3128628251988819,-0.4877518079508566
|
||||
2.4283,0.21010903335482733,-0.8060429810797759
|
||||
3.7388,0.5237781003915357,0.05963819257139614
|
||||
1.6021,0.089029251421537,-1.346757015256811
|
||||
2.3352,0.1964652089805967,-0.8543151087700283
|
||||
4.0982,0.6028331353658,0.26068721102120757
|
||||
1.9531,0.1404683744650917,-1.078217399304828
|
||||
3.2386,0.39461943668028376,-0.26729910818310126
|
||||
5.1169,0.7585424250568029,0.7016216442421952
|
||||
4.692,0.7334477145748096,0.6232738623769376
|
||||
4.0,0.5812326778408341,0.20504797284322906
|
||||
6.4238,0.7904717695634116,0.8080592738489488
|
||||
3.7375,0.5234921472878448,0.05892015392758171
|
||||
2.8233,0.28300365512792947,-0.5739416160725967
|
||||
2.8009,0.2769834444205546,-0.5918263314054893
|
||||
3.767,0.5299810831023711,0.07522231004048434
|
||||
3.6761,0.5099863622365932,0.025034712726592745
|
||||
5.0282,0.7563753634164814,0.6946905154674345
|
||||
3.5296,0.47282842399483976,-0.06816178418735802
|
||||
5.215,0.7609391414820063,0.70932677431432
|
||||
4.0125,0.583982226914786,0.21209163422132077
|
||||
9.4667,0.8648139551928855,1.1022060793492992
|
||||
5.9062,0.7778260975788522,0.7648719390554228
|
||||
3.9864,0.5782411684483745,0.19739599549652825
|
||||
2.0734,0.15809836449967754,-1.0023041289181829
|
||||
2.875,0.29689851644807563,-0.5333417486766315
|
||||
3.3611,0.4275424639862395,-0.1826343528680402
|
||||
2.8214,0.2824930122554289,-0.5754514381784691
|
||||
0.9946,9.999999977795539e-08,-5.199337582605575
|
||||
4.5446,0.7010250318947692,0.5273508961812362
|
||||
4.6908,0.7331837578637103,0.6224705781821348
|
||||
9.3198,0.861224988395104,1.085839464021226
|
||||
1.2826,0.04220645993317308,-1.725635992615424
|
||||
2.4943,0.2197813470895128,-0.7729318836756727
|
||||
10.1882,0.8824411815005742,1.1872788772739642
|
||||
4.6731,0.7292903963749944,0.6106682834354926
|
||||
4.375,0.6637191500593902,0.42263484449515154
|
||||
2.8173,0.281391098688454,-0.578713956288219
|
||||
2.0903,0.16057506301658947,-0.9920971718209968
|
||||
2.725,0.2565846054611911,-0.6539109002088199
|
||||
2.8547,0.29144270049451715,-0.5491749392049887
|
||||
2.25,0.18397913125036633,-0.9003044332369591
|
||||
1.9444,0.13919338765461042,-1.0839504359991745
|
||||
1.7167,0.10582390526994544,-1.2490472005138797
|
||||
1.9342,0.13769857553197723,-1.0907176288996137
|
||||
4.9524,0.7545234663213701,0.6887937530976251
|
||||
3.65,0.5042453037701816,0.010641599310222728
|
||||
3.0856,0.35349924747366146,-0.37589024233948426
|
||||
3.2396,0.394888196086863,-0.26660099147199706
|
||||
2.9324,0.3123253063857234,-0.48926992156445664
|
||||
3.495,0.46352934852719846,-0.09154607537767347
|
||||
1.9818,0.14467436543759887,-1.0595514059995472
|
||||
4.6964,0.7344155558488406,0.6262226873574688
|
||||
3.925,0.5647353833971228,0.16298628506764115
|
||||
3.625,0.4984680713824984,-0.003839985024398658
|
||||
2.9688,0.3221081487852074,-0.4618117868349911
|
||||
4.0417,0.5904051735515374,0.22858735593517016
|
||||
9.7956,0.8728494295277418,1.1399643807904891
|
||||
3.8732,0.5533412520346663,0.13410759250347515
|
||||
2.6998,0.24989741485432906,-0.6748126069650576
|
||||
2.006,0.1482208804736502,-1.0440943039822579
|
||||
4.25,0.6362236593198715,0.34838284885436693
|
||||
3.1839999999999997,0.3799451730810577,-0.3056247863400478
|
||||
5.9658,0.7792822066404437,0.7697712668826051
|
||||
2.628,0.23937510991265604,-0.7083141049607123
|
||||
2.5057,0.2214520194618676,-0.7672985355850374
|
||||
5.155,0.7594732598763774,0.7046091860898294
|
||||
4.6,0.7132110333905237,0.5627899373325242
|
||||
4.6681,0.7281905767454135,0.6073497232884044
|
||||
5.5942,0.7702035132295815,0.7395172425174374
|
||||
5.1104,0.7583836212161931,0.7011125840899696
|
||||
3.0759,0.35089228122984295,-0.38291260834816176
|
||||
3.5757,0.4852182326381423,-0.037060878177008386
|
||||
3.6845,0.5118340592142888,0.02966793907608783
|
||||
6.4667,0.7915198749114363,0.8117061751243242
|
||||
5.273,0.7623561603674476,0.7139021637992619
|
||||
3.0635,0.3475596645882605,-0.3919172875113563
|
||||
11.2866,0.9092765874276221,1.3363135019854007
|
||||
4.0444,0.5909990761515111,0.23011572277987793
|
||||
5.2541,0.7618944076616745,0.7124095804237914
|
||||
5.5791,0.7698345996921648,0.7383022482134728
|
||||
4.5375,0.6994632880207644,0.5228574972480653
|
||||
9.8144,0.8733087390975055,1.1421720249180776
|
||||
6.7257,0.7978475971757347,0.8339577268169219
|
||||
4.1442,0.6129514759579427,0.28701994562086086
|
||||
4.0313,0.5881175487220095,0.22270526607462063
|
||||
2.2791,0.18824374230611404,-0.88438672867467
|
||||
4.1679,0.6181646210021556,0.3006639542199363
|
||||
3.2852,0.40714362502687595,-0.23489883984002982
|
||||
3.2768,0.4048860460116104,-0.24072005796234255
|
||||
5.021,0.7561994576238059,0.6941293646505097
|
||||
4.875,0.7526324790501087,0.682797132481279
|
||||
4.419,0.6733975627997006,0.44931439580522814
|
||||
3.3272,0.41843152010320356,-0.20590766210173994
|
||||
4.2386,0.6337160705644274,0.341711709817628
|
||||
1.245,0.03669617210856439,-1.7903831065617093
|
||||
5.152,0.7593999657960959,0.704373718687922
|
||||
4.8125,0.7511055190442452,0.6779727645490474
|
||||
2.1638,0.1713465033120347,-0.9488576690059357
|
||||
7.1621,0.8085094427206763,0.872416661000832
|
||||
1.5372,0.0795181429157629,-1.408320169672634
|
||||
10.0481,0.8790183479514304,1.1700935990858978
|
||||
3.3869,0.43447645667598356,-0.16498865287495923
|
||||
5.4591,0.7669028364809068,0.7286850708477646
|
||||
4.4318,0.6762131010514274,0.4571353017049396
|
||||
6.5044,0.7924409371869732,0.8149199625504349
|
||||
4.2865,0.644252342615811,0.3698485817763834
|
||||
3.0461,0.3428832509137819,-0.40460687823111086
|
||||
11.3283,0.9102953751435343,1.3425761736583905
|
||||
2.7026,0.25056439475381626,-0.6727147380915197
|
||||
3.016,0.33479359277574705,-0.4267146406076349
|
||||
3.0943,0.35583745431090086,-0.3696075717076287
|
||||
3.225,0.39096430875080623,-0.2768065942137824
|
||||
6.187,0.7846864234931958,0.7881189136255171
|
||||
3.8158,0.5407153226870792,0.10223599868489697
|
||||
3.0147,0.33444420554719406,-0.42767409705501475
|
||||
15.0,0.9999999000000003,5.19933758270342
|
||||
3.1364,0.36715222532788644,-0.33940526790689773
|
||||
2.9,0.30361750161255635,-0.5140242974144523
|
||||
5.5941,0.7702010700935721,0.739509192607556
|
||||
3.4028,0.4387497312405934,-0.15413985695864643
|
||||
6.0062,0.7802692335882339,0.7731028191405914
|
||||
8.3792,0.8382448510908602,0.9872699911993058
|
||||
3.8036,0.5380317627909023,0.09547634997560216
|
||||
2.0926,0.1609121284952224,-0.9907160390730209
|
||||
6.7703,0.798937235835919,0.837831173103705
|
||||
4.2569,0.6377414104086929,0.3524281695696263
|
||||
4.744,0.7448858387224493,0.6584822133838069
|
||||
9.7037,0.87060418753512,1.1292518063856445
|
||||
5.1292,0.7588429307859569,0.7025854406375655
|
||||
2.3148,0.19347558473533027,-0.8651596484776254
|
||||
3.3021,0.41168565899806486,-0.22321096549396113
|
||||
1.95,0.1400140688199777,-1.0802561341135346
|
||||
3.025,0.33721242743496016,-0.42008295383958816
|
||||
2.6523,0.2429362799695175,-0.6968885244090585
|
||||
1.2188,0.032856556656310446,-1.8403755990446928
|
||||
5.827999999999999,0.7759155652195158,0.7584713308686281
|
||||
3.1587,0.3731455600946033,-0.3235336598161438
|
||||
2.45,0.21328917287062546,-0.7950604390156254
|
||||
2.3851,0.20377806436485135,-0.8282019693528496
|
||||
2.1221,0.16523535963421065,-0.9731661745087966
|
||||
3.5313,0.47328531498602444,-0.06701390942662634
|
||||
3.4821,0.4600623521823264,-0.10027663775339715
|
||||
7.8252,0.8247098775988859,0.9334643965884332
|
||||
5.1878,0.7602746084874545,0.7071861842977014
|
||||
3.7459,0.5253398442655404,0.06356034060479637
|
||||
6.0097,0.7803547433485621,0.773391847490294
|
||||
2.3194,0.19414971569259623,-0.862705456837268
|
||||
4.2061,0.6265672429721525,0.32277517627250846
|
||||
2.267,0.18647048478808834,-0.8909780060851004
|
||||
2.2109,0.1782490181136057,-0.9220585498888423
|
||||
2.7589,0.265695549344227,-0.6258838950234518
|
||||
2.6553,0.2433759305938214,-0.6954842771160817
|
||||
6.3325,0.7882411863868461,0.8003334507140982
|
||||
5.7233,0.7733576018176932,0.7499500376839813
|
||||
4.337,0.6553605208745764,0.3998337668630456
|
||||
3.9667,0.5739078791078263,0.18633222024962756
|
||||
5.8623,0.7767535608707337,0.7612749184789059
|
||||
1.6806,0.1005334427574887,-1.2785178769957448
|
||||
3.5851,0.48774457105998714,-0.03072463804138443
|
||||
2.9716,0.32286067512362926,-0.45971423541894674
|
||||
3.9,0.5592362852492191,0.14903320783722215
|
||||
2.7431,0.26144915072027514,-0.6388841474573944
|
||||
3.3621,0.4278112233928187,-0.18194938611936773
|
||||
1.9464,0.13948648807081301,-1.0826293461973695
|
||||
7.3518,0.8131440717304732,0.8895420092252918
|
||||
4.775,0.750189343040727,0.6750857069958222
|
||||
3.5968,0.4908890561169641,-0.022839735128387998
|
||||
6.221,0.7855170897363856,0.7909625929859541
|
||||
10.0968,0.8802081551879993,1.1760279942069634
|
||||
1.9483,0.1397649334662055,-1.0813760586453744
|
||||
2.0469,0.15421478398499322,-1.0185228067876306
|
||||
3.725,0.5207425982138929,0.05201743189676979
|
||||
3.675,0.5097444019180853,0.02442802263182327
|
||||
1.8529,0.1257840436133419,-1.1465489456932518
|
||||
1.7159,0.10570666510346441,-1.2496885811666654
|
||||
1.7386,0.10903335482736382,-1.2316851754273808
|
||||
3.6687,0.5083586291848137,0.020953509422343072
|
||||
3.4671,0.45603096108363783,-0.11043812057602183
|
||||
4.8233,0.7513693777332584,0.6788052852389578
|
||||
4.3036,0.6480137257489771,0.3799634475571365
|
||||
1.6488,0.0958731461398675,-1.3054305179573955
|
||||
2.9453,0.3157923027305955,-0.47949769802476955
|
||||
5.0096,0.7559209401187363,0.6932413225605799
|
||||
3.175,0.3775263384218447,-0.31198400815546146
|
||||
4.2031,0.625907351194404,0.32103311272888746
|
||||
3.1667,0.3752956353472371,-0.3178598218476214
|
||||
5.7204,0.7732867508734211,0.7497147894780406
|
||||
3.375,0.43127821973769076,-0.17312084617136517
|
||||
6.5483,0.7935134738950917,0.818672917584347
|
||||
4.2206,0.6297567198979367,0.33120908249828585
|
||||
2.6631,0.2445190222170115,-0.6918396326662699
|
||||
3.5363,0.4746291120189207,-0.06363831319524592
|
||||
median_income,median_income_uniform,median_income_normal
|
||||
4.4896,0.688927015969381,0.4928112398942898
|
||||
2.1029,0.16242159563866576,-0.9845540061415601
|
||||
2.3889,0.20433495515563627,-0.8262365643809355
|
||||
3.707,0.5167832475474021,0.04208177981426486
|
||||
6.4788,0.7918154943685715,0.8127367410844715
|
||||
4.4074,0.6708459812590735,0.44225037403262457
|
||||
5.2907,0.7627885954411082,0.71530142355699
|
||||
1.5156,0.07635265842077493,-1.430040774223022
|
||||
8.4411,0.8397571522806675,0.9934602851577232
|
||||
4.4085,0.6710879415775812,0.44291928632820865
|
||||
2.1439,0.16843015417081886,-0.960387338010023
|
||||
2.8971,0.3028380993334766,-0.5162551753422808
|
||||
6.1008,0.7825804402531089,0.780937730437047
|
||||
3.5258,0.47180713824983866,-0.07072794893657437
|
||||
2.7694,0.2685175231133089,-0.6173027607303029
|
||||
2.2356,0.18186880825370766,-0.9082661605741446
|
||||
1.9509,0.14014596400726886,-1.0796637810446446
|
||||
4.0905,0.6011394131362455,0.25629744014575906
|
||||
3.6726,0.5092164884958867,0.023104366055624576
|
||||
3.1696,0.3760750376263169,-0.31580558952782856
|
||||
2.5389,0.2263174863708306,-0.7510293827903433
|
||||
3.0319,0.3390668673403568,-0.4150111583096138
|
||||
4.6779,0.7303462232193919,0.61386044030116
|
||||
2.9076,0.3056600731025585,-0.5081899386067169
|
||||
2.8616,0.29329714039991395,-0.5437779548183906
|
||||
1.4722,0.0699923793891787,-1.475847787309027
|
||||
5.6413,0.7713542302900003,0.7433140959507171
|
||||
2.1167,0.1644439885104636,-0.9763562034967139
|
||||
4.7308,0.7419823149003563,0.6494688564486208
|
||||
4.8173,0.7512227895726955,0.6783427156995535
|
||||
2.3438,0.19772554077026783,-0.8497733992585795
|
||||
1.7333,0.10825663872442697,-1.235852818864947
|
||||
1.4429,0.06569845829181076,-1.5086161030849279
|
||||
2.3253,0.19501436192039387,-0.8595652745311925
|
||||
2.4022,0.20628407292338352,-0.8193826233304142
|
||||
3.4048,0.4392872500537518,-0.15277653846062797
|
||||
6.6073,0.7949549241406269,0.8237349930522744
|
||||
4.1080000000000005,0.6049887818397783,0.2662814785234653
|
||||
4.2829,0.6434604724825127,0.3677239840415221
|
||||
1.5727,0.0847206753033589,-1.3740010688161062
|
||||
2.5211,0.22370889266662755,-0.7597270013437459
|
||||
4.2679,0.6401610135937705,0.35888920867930674
|
||||
4.7328,0.7424222427521885,0.6508311116191982
|
||||
4.7069,0.7367251770709602,0.6332817852325006
|
||||
2.465,0.21548742599214485,-0.7875245686101101
|
||||
5.0267,0.7563387163763406,0.6945735910123612
|
||||
2.8043,0.277897226402924,-0.5890996132954226
|
||||
2.4053,0.20673837856849753,-0.8177906162820535
|
||||
1.2176,0.03268069640658888,-1.8427782558151542
|
||||
2.39,0.2044961603845477,-0.8256682268514074
|
||||
3.6364,0.501253794377722,0.0031428016114447123
|
||||
6.0162,0.780513547189172,0.7739287859106961
|
||||
2.8088,0.2791066437325306,-0.5854974399696314
|
||||
3.3984,0.4375671898516448,-0.15714017005326586
|
||||
4.5,0.6912146407989088,0.4992962189785684
|
||||
3.9079,0.5609740002639567,0.1534391166508579
|
||||
4.9618,0.7547531211062519,0.6895237101278
|
||||
2.9344,0.3128628251988819,-0.4877518079508566
|
||||
2.4283,0.21010903335482733,-0.8060429810797759
|
||||
3.7388,0.5237781003915357,0.05963819257139614
|
||||
1.6021,0.089029251421537,-1.346757015256811
|
||||
2.3352,0.1964652089805967,-0.8543151087700283
|
||||
4.0982,0.6028331353658,0.26068721102120757
|
||||
1.9531,0.1404683744650917,-1.078217399304828
|
||||
3.2386,0.39461943668028376,-0.26729910818310126
|
||||
5.1169,0.7585424250568029,0.7016216442421952
|
||||
4.692,0.7334477145748096,0.6232738623769376
|
||||
4.0,0.5812326778408341,0.20504797284322906
|
||||
6.4238,0.7904717695634116,0.8080592738489488
|
||||
3.7375,0.5234921472878448,0.05892015392758171
|
||||
2.8233,0.28300365512792947,-0.5739416160725967
|
||||
2.8009,0.2769834444205546,-0.5918263314054893
|
||||
3.767,0.5299810831023711,0.07522231004048434
|
||||
3.6761,0.5099863622365932,0.025034712726592745
|
||||
5.0282,0.7563753634164814,0.6946905154674345
|
||||
3.5296,0.47282842399483976,-0.06816178418735802
|
||||
5.215,0.7609391414820063,0.70932677431432
|
||||
4.0125,0.583982226914786,0.21209163422132077
|
||||
9.4667,0.8648139551928855,1.1022060793492992
|
||||
5.9062,0.7778260975788522,0.7648719390554228
|
||||
3.9864,0.5782411684483745,0.19739599549652825
|
||||
2.0734,0.15809836449967754,-1.0023041289181829
|
||||
2.875,0.29689851644807563,-0.5333417486766315
|
||||
3.3611,0.4275424639862395,-0.1826343528680402
|
||||
2.8214,0.2824930122554289,-0.5754514381784691
|
||||
0.9946,9.999999977795539e-08,-5.199337582605575
|
||||
4.5446,0.7010250318947692,0.5273508961812362
|
||||
4.6908,0.7331837578637103,0.6224705781821348
|
||||
9.3198,0.861224988395104,1.085839464021226
|
||||
1.2826,0.04220645993317308,-1.725635992615424
|
||||
2.4943,0.2197813470895128,-0.7729318836756727
|
||||
10.1882,0.8824411815005742,1.1872788772739642
|
||||
4.6731,0.7292903963749944,0.6106682834354926
|
||||
4.375,0.6637191500593902,0.42263484449515154
|
||||
2.8173,0.281391098688454,-0.578713956288219
|
||||
2.0903,0.16057506301658947,-0.9920971718209968
|
||||
2.725,0.2565846054611911,-0.6539109002088199
|
||||
2.8547,0.29144270049451715,-0.5491749392049887
|
||||
2.25,0.18397913125036633,-0.9003044332369591
|
||||
1.9444,0.13919338765461042,-1.0839504359991745
|
||||
1.7167,0.10582390526994544,-1.2490472005138797
|
||||
1.9342,0.13769857553197723,-1.0907176288996137
|
||||
4.9524,0.7545234663213701,0.6887937530976251
|
||||
3.65,0.5042453037701816,0.010641599310222728
|
||||
3.0856,0.35349924747366146,-0.37589024233948426
|
||||
3.2396,0.394888196086863,-0.26660099147199706
|
||||
2.9324,0.3123253063857234,-0.48926992156445664
|
||||
3.495,0.46352934852719846,-0.09154607537767347
|
||||
1.9818,0.14467436543759887,-1.0595514059995472
|
||||
4.6964,0.7344155558488406,0.6262226873574688
|
||||
3.925,0.5647353833971228,0.16298628506764115
|
||||
3.625,0.4984680713824984,-0.003839985024398658
|
||||
2.9688,0.3221081487852074,-0.4618117868349911
|
||||
4.0417,0.5904051735515374,0.22858735593517016
|
||||
9.7956,0.8728494295277418,1.1399643807904891
|
||||
3.8732,0.5533412520346663,0.13410759250347515
|
||||
2.6998,0.24989741485432906,-0.6748126069650576
|
||||
2.006,0.1482208804736502,-1.0440943039822579
|
||||
4.25,0.6362236593198715,0.34838284885436693
|
||||
3.1839999999999997,0.3799451730810577,-0.3056247863400478
|
||||
5.9658,0.7792822066404437,0.7697712668826051
|
||||
2.628,0.23937510991265604,-0.7083141049607123
|
||||
2.5057,0.2214520194618676,-0.7672985355850374
|
||||
5.155,0.7594732598763774,0.7046091860898294
|
||||
4.6,0.7132110333905237,0.5627899373325242
|
||||
4.6681,0.7281905767454135,0.6073497232884044
|
||||
5.5942,0.7702035132295815,0.7395172425174374
|
||||
5.1104,0.7583836212161931,0.7011125840899696
|
||||
3.0759,0.35089228122984295,-0.38291260834816176
|
||||
3.5757,0.4852182326381423,-0.037060878177008386
|
||||
3.6845,0.5118340592142888,0.02966793907608783
|
||||
6.4667,0.7915198749114363,0.8117061751243242
|
||||
5.273,0.7623561603674476,0.7139021637992619
|
||||
3.0635,0.3475596645882605,-0.3919172875113563
|
||||
11.2866,0.9092765874276221,1.3363135019854007
|
||||
4.0444,0.5909990761515111,0.23011572277987793
|
||||
5.2541,0.7618944076616745,0.7124095804237914
|
||||
5.5791,0.7698345996921648,0.7383022482134728
|
||||
4.5375,0.6994632880207644,0.5228574972480653
|
||||
9.8144,0.8733087390975055,1.1421720249180776
|
||||
6.7257,0.7978475971757347,0.8339577268169219
|
||||
4.1442,0.6129514759579427,0.28701994562086086
|
||||
4.0313,0.5881175487220095,0.22270526607462063
|
||||
2.2791,0.18824374230611404,-0.88438672867467
|
||||
4.1679,0.6181646210021556,0.3006639542199363
|
||||
3.2852,0.40714362502687595,-0.23489883984002982
|
||||
3.2768,0.4048860460116104,-0.24072005796234255
|
||||
5.021,0.7561994576238059,0.6941293646505097
|
||||
4.875,0.7526324790501087,0.682797132481279
|
||||
4.419,0.6733975627997006,0.44931439580522814
|
||||
3.3272,0.41843152010320356,-0.20590766210173994
|
||||
4.2386,0.6337160705644274,0.341711709817628
|
||||
1.245,0.03669617210856439,-1.7903831065617093
|
||||
5.152,0.7593999657960959,0.704373718687922
|
||||
4.8125,0.7511055190442452,0.6779727645490474
|
||||
2.1638,0.1713465033120347,-0.9488576690059357
|
||||
7.1621,0.8085094427206763,0.872416661000832
|
||||
1.5372,0.0795181429157629,-1.408320169672634
|
||||
10.0481,0.8790183479514304,1.1700935990858978
|
||||
3.3869,0.43447645667598356,-0.16498865287495923
|
||||
5.4591,0.7669028364809068,0.7286850708477646
|
||||
4.4318,0.6762131010514274,0.4571353017049396
|
||||
6.5044,0.7924409371869732,0.8149199625504349
|
||||
4.2865,0.644252342615811,0.3698485817763834
|
||||
3.0461,0.3428832509137819,-0.40460687823111086
|
||||
11.3283,0.9102953751435343,1.3425761736583905
|
||||
2.7026,0.25056439475381626,-0.6727147380915197
|
||||
3.016,0.33479359277574705,-0.4267146406076349
|
||||
3.0943,0.35583745431090086,-0.3696075717076287
|
||||
3.225,0.39096430875080623,-0.2768065942137824
|
||||
6.187,0.7846864234931958,0.7881189136255171
|
||||
3.8158,0.5407153226870792,0.10223599868489697
|
||||
3.0147,0.33444420554719406,-0.42767409705501475
|
||||
15.0,0.9999999000000003,5.19933758270342
|
||||
3.1364,0.36715222532788644,-0.33940526790689773
|
||||
2.9,0.30361750161255635,-0.5140242974144523
|
||||
5.5941,0.7702010700935721,0.739509192607556
|
||||
3.4028,0.4387497312405934,-0.15413985695864643
|
||||
6.0062,0.7802692335882339,0.7731028191405914
|
||||
8.3792,0.8382448510908602,0.9872699911993058
|
||||
3.8036,0.5380317627909023,0.09547634997560216
|
||||
2.0926,0.1609121284952224,-0.9907160390730209
|
||||
6.7703,0.798937235835919,0.837831173103705
|
||||
4.2569,0.6377414104086929,0.3524281695696263
|
||||
4.744,0.7448858387224493,0.6584822133838069
|
||||
9.7037,0.87060418753512,1.1292518063856445
|
||||
5.1292,0.7588429307859569,0.7025854406375655
|
||||
2.3148,0.19347558473533027,-0.8651596484776254
|
||||
3.3021,0.41168565899806486,-0.22321096549396113
|
||||
1.95,0.1400140688199777,-1.0802561341135346
|
||||
3.025,0.33721242743496016,-0.42008295383958816
|
||||
2.6523,0.2429362799695175,-0.6968885244090585
|
||||
1.2188,0.032856556656310446,-1.8403755990446928
|
||||
5.827999999999999,0.7759155652195158,0.7584713308686281
|
||||
3.1587,0.3731455600946033,-0.3235336598161438
|
||||
2.45,0.21328917287062546,-0.7950604390156254
|
||||
2.3851,0.20377806436485135,-0.8282019693528496
|
||||
2.1221,0.16523535963421065,-0.9731661745087966
|
||||
3.5313,0.47328531498602444,-0.06701390942662634
|
||||
3.4821,0.4600623521823264,-0.10027663775339715
|
||||
7.8252,0.8247098775988859,0.9334643965884332
|
||||
5.1878,0.7602746084874545,0.7071861842977014
|
||||
3.7459,0.5253398442655404,0.06356034060479637
|
||||
6.0097,0.7803547433485621,0.773391847490294
|
||||
2.3194,0.19414971569259623,-0.862705456837268
|
||||
4.2061,0.6265672429721525,0.32277517627250846
|
||||
2.267,0.18647048478808834,-0.8909780060851004
|
||||
2.2109,0.1782490181136057,-0.9220585498888423
|
||||
2.7589,0.265695549344227,-0.6258838950234518
|
||||
2.6553,0.2433759305938214,-0.6954842771160817
|
||||
6.3325,0.7882411863868461,0.8003334507140982
|
||||
5.7233,0.7733576018176932,0.7499500376839813
|
||||
4.337,0.6553605208745764,0.3998337668630456
|
||||
3.9667,0.5739078791078263,0.18633222024962756
|
||||
5.8623,0.7767535608707337,0.7612749184789059
|
||||
1.6806,0.1005334427574887,-1.2785178769957448
|
||||
3.5851,0.48774457105998714,-0.03072463804138443
|
||||
2.9716,0.32286067512362926,-0.45971423541894674
|
||||
3.9,0.5592362852492191,0.14903320783722215
|
||||
2.7431,0.26144915072027514,-0.6388841474573944
|
||||
3.3621,0.4278112233928187,-0.18194938611936773
|
||||
1.9464,0.13948648807081301,-1.0826293461973695
|
||||
7.3518,0.8131440717304732,0.8895420092252918
|
||||
4.775,0.750189343040727,0.6750857069958222
|
||||
3.5968,0.4908890561169641,-0.022839735128387998
|
||||
6.221,0.7855170897363856,0.7909625929859541
|
||||
10.0968,0.8802081551879993,1.1760279942069634
|
||||
1.9483,0.1397649334662055,-1.0813760586453744
|
||||
2.0469,0.15421478398499322,-1.0185228067876306
|
||||
3.725,0.5207425982138929,0.05201743189676979
|
||||
3.675,0.5097444019180853,0.02442802263182327
|
||||
1.8529,0.1257840436133419,-1.1465489456932518
|
||||
1.7159,0.10570666510346441,-1.2496885811666654
|
||||
1.7386,0.10903335482736382,-1.2316851754273808
|
||||
3.6687,0.5083586291848137,0.020953509422343072
|
||||
3.4671,0.45603096108363783,-0.11043812057602183
|
||||
4.8233,0.7513693777332584,0.6788052852389578
|
||||
4.3036,0.6480137257489771,0.3799634475571365
|
||||
1.6488,0.0958731461398675,-1.3054305179573955
|
||||
2.9453,0.3157923027305955,-0.47949769802476955
|
||||
5.0096,0.7559209401187363,0.6932413225605799
|
||||
3.175,0.3775263384218447,-0.31198400815546146
|
||||
4.2031,0.625907351194404,0.32103311272888746
|
||||
3.1667,0.3752956353472371,-0.3178598218476214
|
||||
5.7204,0.7732867508734211,0.7497147894780406
|
||||
3.375,0.43127821973769076,-0.17312084617136517
|
||||
6.5483,0.7935134738950917,0.818672917584347
|
||||
4.2206,0.6297567198979367,0.33120908249828585
|
||||
2.6631,0.2445190222170115,-0.6918396326662699
|
||||
3.5363,0.4746291120189207,-0.06363831319524592
|
||||
|
||||
|
@@ -1,63 +1,63 @@
|
||||
{
|
||||
"id": "b308e5b8-9b2a-47f8-9d32-0f542b4a34a4",
|
||||
"name": "read_csv_duplicate_headers",
|
||||
"blocks": [
|
||||
{
|
||||
"id": "8d9ec228-6a4b-4abf-afb7-65f58dda1581",
|
||||
"type": "Microsoft.DPrep.GetFilesBlock",
|
||||
"arguments": {
|
||||
"path": {
|
||||
"target": 1,
|
||||
"resourceDetails": [
|
||||
{
|
||||
"path": "https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv",
|
||||
"sas": {
|
||||
"id": "https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv",
|
||||
"secretType": "AzureMLSecret"
|
||||
},
|
||||
"storageAccountName": null,
|
||||
"storageAccountKey": null
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "4ad0460f-ec65-47c0-a0a4-44345404a462",
|
||||
"type": "Microsoft.DPrep.ParseDelimitedBlock",
|
||||
"arguments": {
|
||||
"columnHeadersMode": 3,
|
||||
"fileEncoding": 0,
|
||||
"handleQuotedLineBreaks": false,
|
||||
"preview": false,
|
||||
"separator": ",",
|
||||
"skipRows": 0,
|
||||
"skipRowsMode": 0
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "1a3e11ba-5854-48da-aa47-53af61beb782",
|
||||
"type": "Microsoft.DPrep.DropColumnsBlock",
|
||||
"arguments": {
|
||||
"columns": {
|
||||
"type": 0,
|
||||
"details": {
|
||||
"selectedColumns": [
|
||||
"Path"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
}
|
||||
],
|
||||
"inspectors": []
|
||||
{
|
||||
"id": "b308e5b8-9b2a-47f8-9d32-0f542b4a34a4",
|
||||
"name": "read_csv_duplicate_headers",
|
||||
"blocks": [
|
||||
{
|
||||
"id": "8d9ec228-6a4b-4abf-afb7-65f58dda1581",
|
||||
"type": "Microsoft.DPrep.GetFilesBlock",
|
||||
"arguments": {
|
||||
"path": {
|
||||
"target": 1,
|
||||
"resourceDetails": [
|
||||
{
|
||||
"path": "https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv",
|
||||
"sas": {
|
||||
"id": "https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv",
|
||||
"secretType": "AzureMLSecret"
|
||||
},
|
||||
"storageAccountName": null,
|
||||
"storageAccountKey": null
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "4ad0460f-ec65-47c0-a0a4-44345404a462",
|
||||
"type": "Microsoft.DPrep.ParseDelimitedBlock",
|
||||
"arguments": {
|
||||
"columnHeadersMode": 3,
|
||||
"fileEncoding": 0,
|
||||
"handleQuotedLineBreaks": false,
|
||||
"preview": false,
|
||||
"separator": ",",
|
||||
"skipRows": 0,
|
||||
"skipRowsMode": 0
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
},
|
||||
{
|
||||
"id": "1a3e11ba-5854-48da-aa47-53af61beb782",
|
||||
"type": "Microsoft.DPrep.DropColumnsBlock",
|
||||
"arguments": {
|
||||
"columns": {
|
||||
"type": 0,
|
||||
"details": {
|
||||
"selectedColumns": [
|
||||
"Path"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"isEnabled": true,
|
||||
"name": null,
|
||||
"annotation": null
|
||||
}
|
||||
],
|
||||
"inspectors": []
|
||||
}
|
||||
@@ -1,11 +1,11 @@
|
||||
Stream Path
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/01/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/02/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/03/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/04/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/05/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/06/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/07/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/08/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/09/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/10/train.csv
|
||||
Stream Path
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/01/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/02/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/03/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/04/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/05/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/06/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/07/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/08/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/09/train.csv
|
||||
https://dataset.blob.core.windows.net/blobstore/container/2019/01/10/train.csv
|
||||
|
||||
|
@@ -1,360 +1,360 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Add Column using Expression\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With Azure ML Data Prep you can add a new column to data with `Dataflow.add_column` by using a Data Prep expression to calculate the value from existing columns. This is similar to using Python to create a [new script column](./custom-python-transforms.ipynb#New-Script-Column) except the Data Prep expressions are more limited and will execute faster. The expressions used are the same as for [filtering rows](./filtering.ipynb#Filtering-rows) and hence have the same functions and operators available.\n",
|
||||
"<p>\n",
|
||||
"Here we add additional columns. First we get input data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# loading data\n",
|
||||
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `substring(start, length)`\n",
|
||||
"Add a new column \"Case Category\" using the `substring(start, length)` expression to extract the prefix from the \"Case Number\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"case_category = dflow.add_column(new_column_name='Case Category',\n",
|
||||
" prior_column='Case Number',\n",
|
||||
" expression=dflow['Case Number'].substring(0, 2))\n",
|
||||
"case_category.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `substring(start)`\n",
|
||||
"Add a new column \"Case Id\" using the `substring(start)` expression to extract just the number from \"Case Number\" column and then convert it to numeric."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"case_id = dflow.add_column(new_column_name='Case Id',\n",
|
||||
" prior_column='Case Number',\n",
|
||||
" expression=dflow['Case Number'].substring(2))\n",
|
||||
"case_id = case_id.to_number('Case Id')\n",
|
||||
"case_id.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `length()`\n",
|
||||
"Using the length() expression, add a new numeric column \"Length\", which contains the length of the string in \"Primary Type\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_length = dflow.add_column(new_column_name='Length',\n",
|
||||
" prior_column='Primary Type',\n",
|
||||
" expression=dflow['Primary Type'].length())\n",
|
||||
"dflow_length.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `to_upper()`\n",
|
||||
"Using the to_upper() expression, add a new numeric column \"Upper Case\", which contains the length of the string in \"Primary Type\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_to_upper = dflow.add_column(new_column_name='Upper Case',\n",
|
||||
" prior_column='Primary Type',\n",
|
||||
" expression=dflow['Primary Type'].to_upper())\n",
|
||||
"dflow_to_upper.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `to_lower()`\n",
|
||||
"Using the to_lower() expression, add a new numeric column \"Lower Case\", which contains the length of the string in \"Primary Type\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_to_lower = dflow.add_column(new_column_name='Lower Case',\n",
|
||||
" prior_column='Primary Type',\n",
|
||||
" expression=dflow['Primary Type'].to_lower())\n",
|
||||
"dflow_to_lower.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `RegEx.extract_record()`\n",
|
||||
"Using the `RegEx.extract_record()` expression, add a new record column \"Stream Date Record\", which contains the name capturing groups in the regex with value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_regex_extract_record = dprep.auto_read_file('../data/stream-path.csv')\n",
|
||||
"regex = dprep.RegEx('\\/(?<year>\\d{4})\\/(?<month>\\d{2})\\/(?<day>\\d{2})\\/')\n",
|
||||
"dflow_regex_extract_record = dflow_regex_extract_record.add_column(new_column_name='Stream Date Record',\n",
|
||||
" prior_column='Stream Path',\n",
|
||||
" expression=regex.extract_record(dflow_regex_extract_record['Stream Path']))\n",
|
||||
"dflow_regex_extract_record.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `create_datetime()`\n",
|
||||
"Using the `create_datetime()` expression, add a new column \"Stream Date\", which contains datetime values constructed from year, month, day values extracted from a record column \"Stream Date Record\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"year = dprep.col('year', dflow_regex_extract_record['Stream Date Record'])\n",
|
||||
"month = dprep.col('month', dflow_regex_extract_record['Stream Date Record'])\n",
|
||||
"day = dprep.col('day', dflow_regex_extract_record['Stream Date Record'])\n",
|
||||
"dflow_create_datetime = dflow_regex_extract_record.add_column(new_column_name='Stream Date',\n",
|
||||
" prior_column='Stream Date Record',\n",
|
||||
" expression=dprep.create_datetime(year, month, day))\n",
|
||||
"dflow_create_datetime.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) + col(column2)`\n",
|
||||
"Add a new column \"Total\" to show the result of adding the values in the \"FBI Code\" column to the \"Community Area\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_total = dflow.add_column(new_column_name='Total',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']+dflow['FBI Code'])\n",
|
||||
"dflow_total.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) - col(column2)`\n",
|
||||
"Add a new column \"Subtract\" to show the result of subtracting the values in the \"FBI Code\" column from the \"Community Area\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_diff = dflow.add_column(new_column_name='Difference',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']-dflow['FBI Code'])\n",
|
||||
"dflow_diff.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) * col(column2)`\n",
|
||||
"Add a new column \"Product\" to show the result of multiplying the values in the \"FBI Code\" column to the \"Community Area\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_prod = dflow.add_column(new_column_name='Product',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']*dflow['FBI Code'])\n",
|
||||
"dflow_prod.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) / col(column2)`\n",
|
||||
"Add a new column \"True Quotient\" to show the result of true (decimal) division of the values in \"Community Area\" column by the \"FBI Code\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_true_div = dflow.add_column(new_column_name='True Quotient',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']/dflow['FBI Code'])\n",
|
||||
"dflow_true_div.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) // col(column2)`\n",
|
||||
"Add a new column \"Floor Quotient\" to show the result of floor (integer) division of the values in \"Community Area\" column by the \"FBI Code\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_floor_div = dflow.add_column(new_column_name='Floor Quotient',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']//dflow['FBI Code'])\n",
|
||||
"dflow_floor_div.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) % col(column2)`\n",
|
||||
"Add a new column \"Mod\" to show the result of applying the modulo operation on the \"FBI Code\" column and the \"Community Area\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_mod = dflow.add_column(new_column_name='Mod',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']%dflow['FBI Code'])\n",
|
||||
"dflow_mod.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) ** col(column2)`\n",
|
||||
"Add a new column \"Power\" to show the result of applying the exponentiation operation when the base is the \"Community Area\" column and the exponent is \"FBI Code\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_pow = dflow.add_column(new_column_name='Power',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']**dflow['FBI Code'])\n",
|
||||
"dflow_pow.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Add Column using Expression\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With Azure ML Data Prep you can add a new column to data with `Dataflow.add_column` by using a Data Prep expression to calculate the value from existing columns. This is similar to using Python to create a [new script column](./custom-python-transforms.ipynb#New-Script-Column) except the Data Prep expressions are more limited and will execute faster. The expressions used are the same as for [filtering rows](./filtering.ipynb#Filtering-rows) and hence have the same functions and operators available.\n",
|
||||
"<p>\n",
|
||||
"Here we add additional columns. First we get input data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# loading data\n",
|
||||
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `substring(start, length)`\n",
|
||||
"Add a new column \"Case Category\" using the `substring(start, length)` expression to extract the prefix from the \"Case Number\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"case_category = dflow.add_column(new_column_name='Case Category',\n",
|
||||
" prior_column='Case Number',\n",
|
||||
" expression=dflow['Case Number'].substring(0, 2))\n",
|
||||
"case_category.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `substring(start)`\n",
|
||||
"Add a new column \"Case Id\" using the `substring(start)` expression to extract just the number from \"Case Number\" column and then convert it to numeric."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"case_id = dflow.add_column(new_column_name='Case Id',\n",
|
||||
" prior_column='Case Number',\n",
|
||||
" expression=dflow['Case Number'].substring(2))\n",
|
||||
"case_id = case_id.to_number('Case Id')\n",
|
||||
"case_id.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `length()`\n",
|
||||
"Using the length() expression, add a new numeric column \"Length\", which contains the length of the string in \"Primary Type\"."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_length = dflow.add_column(new_column_name='Length',\n",
|
||||
" prior_column='Primary Type',\n",
|
||||
" expression=dflow['Primary Type'].length())\n",
|
||||
"dflow_length.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `to_upper()`\n",
|
||||
"Using the to_upper() expression, add a new numeric column \"Upper Case\", which contains the length of the string in \"Primary Type\"."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_to_upper = dflow.add_column(new_column_name='Upper Case',\n",
|
||||
" prior_column='Primary Type',\n",
|
||||
" expression=dflow['Primary Type'].to_upper())\n",
|
||||
"dflow_to_upper.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `to_lower()`\n",
|
||||
"Using the to_lower() expression, add a new numeric column \"Lower Case\", which contains the length of the string in \"Primary Type\"."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_to_lower = dflow.add_column(new_column_name='Lower Case',\n",
|
||||
" prior_column='Primary Type',\n",
|
||||
" expression=dflow['Primary Type'].to_lower())\n",
|
||||
"dflow_to_lower.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `RegEx.extract_record()`\n",
|
||||
"Using the `RegEx.extract_record()` expression, add a new record column \"Stream Date Record\", which contains the name capturing groups in the regex with value."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_regex_extract_record = dprep.auto_read_file('../data/stream-path.csv')\n",
|
||||
"regex = dprep.RegEx('\\/(?<year>\\d{4})\\/(?<month>\\d{2})\\/(?<day>\\d{2})\\/')\n",
|
||||
"dflow_regex_extract_record = dflow_regex_extract_record.add_column(new_column_name='Stream Date Record',\n",
|
||||
" prior_column='Stream Path',\n",
|
||||
" expression=regex.extract_record(dflow_regex_extract_record['Stream Path']))\n",
|
||||
"dflow_regex_extract_record.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `create_datetime()`\n",
|
||||
"Using the `create_datetime()` expression, add a new column \"Stream Date\", which contains datetime values constructed from year, month, day values extracted from a record column \"Stream Date Record\"."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"year = dprep.col('year', dflow_regex_extract_record['Stream Date Record'])\n",
|
||||
"month = dprep.col('month', dflow_regex_extract_record['Stream Date Record'])\n",
|
||||
"day = dprep.col('day', dflow_regex_extract_record['Stream Date Record'])\n",
|
||||
"dflow_create_datetime = dflow_regex_extract_record.add_column(new_column_name='Stream Date',\n",
|
||||
" prior_column='Stream Date Record',\n",
|
||||
" expression=dprep.create_datetime(year, month, day))\n",
|
||||
"dflow_create_datetime.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) + col(column2)`\n",
|
||||
"Add a new column \"Total\" to show the result of adding the values in the \"FBI Code\" column to the \"Community Area\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_total = dflow.add_column(new_column_name='Total',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']+dflow['FBI Code'])\n",
|
||||
"dflow_total.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) - col(column2)`\n",
|
||||
"Add a new column \"Subtract\" to show the result of subtracting the values in the \"FBI Code\" column from the \"Community Area\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_diff = dflow.add_column(new_column_name='Difference',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']-dflow['FBI Code'])\n",
|
||||
"dflow_diff.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) * col(column2)`\n",
|
||||
"Add a new column \"Product\" to show the result of multiplying the values in the \"FBI Code\" column to the \"Community Area\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_prod = dflow.add_column(new_column_name='Product',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']*dflow['FBI Code'])\n",
|
||||
"dflow_prod.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) / col(column2)`\n",
|
||||
"Add a new column \"True Quotient\" to show the result of true (decimal) division of the values in \"Community Area\" column by the \"FBI Code\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_true_div = dflow.add_column(new_column_name='True Quotient',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']/dflow['FBI Code'])\n",
|
||||
"dflow_true_div.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) // col(column2)`\n",
|
||||
"Add a new column \"Floor Quotient\" to show the result of floor (integer) division of the values in \"Community Area\" column by the \"FBI Code\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_floor_div = dflow.add_column(new_column_name='Floor Quotient',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']//dflow['FBI Code'])\n",
|
||||
"dflow_floor_div.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) % col(column2)`\n",
|
||||
"Add a new column \"Mod\" to show the result of applying the modulo operation on the \"FBI Code\" column and the \"Community Area\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_mod = dflow.add_column(new_column_name='Mod',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']%dflow['FBI Code'])\n",
|
||||
"dflow_mod.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `col(column1) ** col(column2)`\n",
|
||||
"Add a new column \"Power\" to show the result of applying the exponentiation operation when the base is the \"Community Area\" column and the exponent is \"FBI Code\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_pow = dflow.add_column(new_column_name='Power',\n",
|
||||
" prior_column='FBI Code',\n",
|
||||
" expression=dflow['Community Area']**dflow['FBI Code'])\n",
|
||||
"dflow_pow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,251 +1,251 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Append Columns and Rows\n",
|
||||
"\n",
|
||||
"Often the data we want does not come in a single dataset: they are coming from different locations, have features that are separated, or are simply not homogeneous. Unsurprisingly, we typically want to work with a single dataset at a time.\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep allows the concatenation of two or more dataflows by means of column and row appends.\n",
|
||||
"\n",
|
||||
"We will demonstrate this by defining a single dataflow that will pull data from multiple datasets.\n",
|
||||
"\n",
|
||||
"## Table of Contents\n",
|
||||
"[append_columns(dataflows)](#append_columns)<br>\n",
|
||||
"[append_rows(dataflows)](#append_rows)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"append_columns\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## `append_columns(dataflows)`\n",
|
||||
"We can append data width-wise, which will change some or all existing rows and potentially adding rows (based on an assumption that data in two datasets are aligned on row number).\n",
|
||||
"\n",
|
||||
"However we cannot do this if the reference dataflows have clashing schema with the target dataflow. Observe:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import auto_read_file"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')\n",
|
||||
"dflow_chicago.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import ExecutionError\n",
|
||||
"try:\n",
|
||||
" dflow_combined_by_column = dflow.append_columns([dflow_chicago])\n",
|
||||
" dflow_combined_by_column.head(5)\n",
|
||||
"except ExecutionError:\n",
|
||||
" print('Cannot append_columns with schema clash!')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As expected, we cannot call `append_columns` with target dataflows that have clashing schema.\n",
|
||||
"\n",
|
||||
"We can make the call once we rename or drop the offending columns. In more complex scenarios, we could opt to skip or filter to make rows align before appending columns. Here we will choose to simply drop the clashing column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_combined_by_column = dflow.append_columns([dflow_chicago.drop_columns(['Ward'])])\n",
|
||||
"dflow_combined_by_column.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice that the resultant schema has more columns in the first N records (N being the number of records in `dataflow` and the extra columns being the width of the schema of our reference dataflow, chicago, minus the `Ward` column). From the N+1th record onwards, we will only have a schema width matching that of the `Ward`-less chicago set.\n",
|
||||
"\n",
|
||||
"Why is this? As much as possible, the data from the reference dataflow(s) will be attached to existing rows in the target dataflow. If there are not enough rows in the target dataflow to attach to, we simply append them as new rows.\n",
|
||||
"\n",
|
||||
"Note that these are appends, not joins (for joins please reference [Join](join.ipynb)), so the append may not be logically correct, but will take effect as long as there are no schema clashes."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Ward-less data after we skip the first N rows\n",
|
||||
"dflow_len = dflow.row_count\n",
|
||||
"dflow_combined_by_column.skip(dflow_len).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"append_rows\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## `append_rows(dataflows)`\n",
|
||||
"We can append data length-wise, which will only have the effect of adding new rows. No existing data will be changed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import auto_read_file"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_spring = auto_read_file(path='../data/crime-spring.csv')\n",
|
||||
"dflow_spring.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')\n",
|
||||
"dflow_chicago.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_combined_by_row = dflow.append_rows([dflow_chicago, dflow_spring])\n",
|
||||
"dflow_combined_by_row.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice that neither schema nor data has changed for the target dataflow.\n",
|
||||
"\n",
|
||||
"If we skip ahead, we will see our target dataflows' data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# chicago data\n",
|
||||
"dflow_len = dflow.row_count\n",
|
||||
"dflow_combined_by_row.skip(dflow_len).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# crimes spring data\n",
|
||||
"dflow_chicago_len = dflow_chicago.row_count\n",
|
||||
"dflow_combined_by_row.skip(dflow_len + dflow_chicago_len).head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Append Columns and Rows\n",
|
||||
"\n",
|
||||
"Often the data we want does not come in a single dataset: they are coming from different locations, have features that are separated, or are simply not homogeneous. Unsurprisingly, we typically want to work with a single dataset at a time.\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep allows the concatenation of two or more dataflows by means of column and row appends.\n",
|
||||
"\n",
|
||||
"We will demonstrate this by defining a single dataflow that will pull data from multiple datasets.\n",
|
||||
"\n",
|
||||
"## Table of Contents\n",
|
||||
"[append_columns(dataflows)](#append_columns)<br>\n",
|
||||
"[append_rows(dataflows)](#append_rows)"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"append_columns\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## `append_columns(dataflows)`\n",
|
||||
"We can append data width-wise, which will change some or all existing rows and potentially adding rows (based on an assumption that data in two datasets are aligned on row number).\n",
|
||||
"\n",
|
||||
"However we cannot do this if the reference dataflows have clashing schema with the target dataflow. Observe:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import auto_read_file"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')\n",
|
||||
"dflow_chicago.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import ExecutionError\n",
|
||||
"try:\n",
|
||||
" dflow_combined_by_column = dflow.append_columns([dflow_chicago])\n",
|
||||
" dflow_combined_by_column.head(5)\n",
|
||||
"except ExecutionError:\n",
|
||||
" print('Cannot append_columns with schema clash!')"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As expected, we cannot call `append_columns` with target dataflows that have clashing schema.\n",
|
||||
"\n",
|
||||
"We can make the call once we rename or drop the offending columns. In more complex scenarios, we could opt to skip or filter to make rows align before appending columns. Here we will choose to simply drop the clashing column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_combined_by_column = dflow.append_columns([dflow_chicago.drop_columns(['Ward'])])\n",
|
||||
"dflow_combined_by_column.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice that the resultant schema has more columns in the first N records (N being the number of records in `dataflow` and the extra columns being the width of the schema of our reference dataflow, chicago, minus the `Ward` column). From the N+1th record onwards, we will only have a schema width matching that of the `Ward`-less chicago set.\n",
|
||||
"\n",
|
||||
"Why is this? As much as possible, the data from the reference dataflow(s) will be attached to existing rows in the target dataflow. If there are not enough rows in the target dataflow to attach to, we simply append them as new rows.\n",
|
||||
"\n",
|
||||
"Note that these are appends, not joins (for joins please reference [Join](join.ipynb)), so the append may not be logically correct, but will take effect as long as there are no schema clashes."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# Ward-less data after we skip the first N rows\n",
|
||||
"dflow_len = dflow.row_count\n",
|
||||
"dflow_combined_by_column.skip(dflow_len).head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"append_rows\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## `append_rows(dataflows)`\n",
|
||||
"We can append data length-wise, which will only have the effect of adding new rows. No existing data will be changed."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import auto_read_file"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_spring = auto_read_file(path='../data/crime-spring.csv')\n",
|
||||
"dflow_spring.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')\n",
|
||||
"dflow_chicago.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_combined_by_row = dflow.append_rows([dflow_chicago, dflow_spring])\n",
|
||||
"dflow_combined_by_row.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice that neither schema nor data has changed for the target dataflow.\n",
|
||||
"\n",
|
||||
"If we skip ahead, we will see our target dataflows' data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# chicago data\n",
|
||||
"dflow_len = dflow.row_count\n",
|
||||
"dflow_combined_by_row.skip(dflow_len).head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# crimes spring data\n",
|
||||
"dflow_chicago_len = dflow_chicago.row_count\n",
|
||||
"dflow_combined_by_row.skip(dflow_len + dflow_chicago_len).head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,133 +1,133 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Assertions\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Frequently, the data we work with while cleaning and preparing data is just a subset of the total data we will need to work with in production. It is also common to be working on a snapshot of a live dataset that is continuously updated and augmented.\n",
|
||||
"\n",
|
||||
"In these cases, some of the assumptions we make as part of our cleaning might turn out to be false. Columns that originally only contained numbers within a certain range might actually contain a wider range of values in later executions. These errors often result in either broken pipelines or bad data.\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep supports creating assertions on data, which are evaluated as the pipeline is executed. These assertions enable us to verify that our assumptions on the data continue to be accurate and, when not, to handle failures in a clean way."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To demonstrate, we will load a dataset and then add some assertions based on what we can see in the first few rows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import auto_read_file\n",
|
||||
"\n",
|
||||
"dflow = auto_read_file('../data/crime-dirty.csv')\n",
|
||||
"dflow.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see there are latitude and longitude columns present in this dataset. By definition, these are constrained to specific ranges of values. We can assert that this is indeed the case so that if any records come through with invalid values, we detect them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import value\n",
|
||||
"\n",
|
||||
"dflow = dflow.assert_value('Latitude', (value <= 90) & (value >= -90), error_code='InvalidLatitude')\n",
|
||||
"dflow = dflow.assert_value('Longitude', (value <= 180) & (value >= -180), error_code='InvalidLongitude')\n",
|
||||
"dflow.keep_columns(['Latitude', 'Longitude']).get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Any assertion failures are represented as Errors in the resulting dataset. From the profile above, you can see that the Error Count for both of these columns is 1. We can use a filter to retrieve the error and see what value caused the assertion to fail."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import col\n",
|
||||
"\n",
|
||||
"dflow_error = dflow.filter(col('Latitude').is_error())\n",
|
||||
"error = dflow_error.head(10)['Latitude'][0]\n",
|
||||
"print(error.originalValue)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Our assertion failed because we were not removing missing values from our data. At this point, we have two options: we can go back and edit our code to avoid this error in the first place or we can resolve it now. In this case, we will just filter these out."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import LocalFileOutput\n",
|
||||
"dflow_clean = dflow.filter(~dflow['Latitude'].is_error())\n",
|
||||
"dflow_clean.get_profile()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Assertions\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Frequently, the data we work with while cleaning and preparing data is just a subset of the total data we will need to work with in production. It is also common to be working on a snapshot of a live dataset that is continuously updated and augmented.\n",
|
||||
"\n",
|
||||
"In these cases, some of the assumptions we make as part of our cleaning might turn out to be false. Columns that originally only contained numbers within a certain range might actually contain a wider range of values in later executions. These errors often result in either broken pipelines or bad data.\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep supports creating assertions on data, which are evaluated as the pipeline is executed. These assertions enable us to verify that our assumptions on the data continue to be accurate and, when not, to handle failures in a clean way."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To demonstrate, we will load a dataset and then add some assertions based on what we can see in the first few rows."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import auto_read_file\n",
|
||||
"\n",
|
||||
"dflow = auto_read_file('../data/crime-dirty.csv')\n",
|
||||
"dflow.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see there are latitude and longitude columns present in this dataset. By definition, these are constrained to specific ranges of values. We can assert that this is indeed the case so that if any records come through with invalid values, we detect them."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import value\n",
|
||||
"\n",
|
||||
"dflow = dflow.assert_value('Latitude', (value <= 90) & (value >= -90), error_code='InvalidLatitude')\n",
|
||||
"dflow = dflow.assert_value('Longitude', (value <= 180) & (value >= -180), error_code='InvalidLongitude')\n",
|
||||
"dflow.keep_columns(['Latitude', 'Longitude']).get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Any assertion failures are represented as Errors in the resulting dataset. From the profile above, you can see that the Error Count for both of these columns is 1. We can use a filter to retrieve the error and see what value caused the assertion to fail."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import col\n",
|
||||
"\n",
|
||||
"dflow_error = dflow.filter(col('Latitude').is_error())\n",
|
||||
"error = dflow_error.head(10)['Latitude'][0]\n",
|
||||
"print(error.originalValue)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Our assertion failed because we were not removing missing values from our data. At this point, we have two options: we can go back and edit our code to avoid this error in the first place or we can resolve it now. In this case, we will just filter these out."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import LocalFileOutput\n",
|
||||
"dflow_clean = dflow.filter(~dflow['Latitude'].is_error())\n",
|
||||
"dflow_clean.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,189 +1,189 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Auto Read File\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep has the ability to load different kinds of text files. The `auto_read_file` entry point can take any text based file (including excel, json and parquet) and auto-detect how to parse the file. It will also attempt to auto-detect the types of each column and apply type transformations to the columns it detects.\n",
|
||||
"\n",
|
||||
"The result will be a Dataflow object that has all the steps added that are required to read the given file(s) and convert their columns to the predicted types. No parameters are required beyond the file path or `FileDataSource` object."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_auto = dprep.auto_read_file('../data/crime_multiple_separators.csv')\n",
|
||||
"dflow_auto.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_auto1 = dprep.auto_read_file('../data/crime.xlsx')\n",
|
||||
"dflow_auto1.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_auto2 = dprep.auto_read_file('../data/crime.parquet')\n",
|
||||
"dflow_auto2.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Looking at the data, we can see that there are two empty columns either side of the 'Completed' column.\n",
|
||||
"If we compare the dataframe to a few rows from the original file:\n",
|
||||
"```\n",
|
||||
"ID |CaseNumber| |Completed|\n",
|
||||
"10140490 |HY329907| |Y|\n",
|
||||
"10139776 |HY329265| |Y|\n",
|
||||
"```\n",
|
||||
"We can see that the `|`'s have disappeared in the dataframe. This is because `|` is a very common separator character in csv files, so `auto_read_file` guessed it was the column separator. For this data we actually want the `|`'s to remain and instead use space as the column separator.\n",
|
||||
"\n",
|
||||
"To achieve this we can use `detect_file_format`. It takes a file path or datasource object and gives back a `FileFormatBuilder` which has learnt some information about the supplied data.\n",
|
||||
"This is what `auto_read_file` is using behind the scenes to 'learn' the contents of the given file and determine how to parse it. With the `FileFormatBuilder` we can take advantage of the intelligent learning aspect of `auto_read_file` but have the chance to modify some of the learnt information."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ffb = dprep.detect_file_format('../data/crime_multiple_separators.csv')\n",
|
||||
"ffb_2 = dprep.detect_file_format('../data/crime.xlsx')\n",
|
||||
"ffb_3 = dprep.detect_file_format('../data/crime_fixed_width_file.txt')\n",
|
||||
"ffb_4 = dprep.detect_file_format('../data/json.json')\n",
|
||||
"\n",
|
||||
"print(ffb.file_format)\n",
|
||||
"print(ffb_2.file_format)\n",
|
||||
"print(ffb_3.file_format)\n",
|
||||
"print(type(ffb_4.file_format))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After calling `detect_file_format` we get a `FileFormatBuilder` that has had `learn` called on it. This means the `file_format` attribute will be populated with a `<Parse|Read><type>Properties` object, it contains all the information that was learnt about the file. As we can see above different file types have corresponding file_formats detected. \n",
|
||||
"Continuing with our delimited example we can change any of these values and then call `ffb.to_dataflow()` to create a `Dataflow` that has the steps required to parse the datasource."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ffb.file_format.separator = ' '\n",
|
||||
"dflow = ffb.to_dataflow()\n",
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The result is our desired dataframe with `|`'s included.\n",
|
||||
"\n",
|
||||
"If we refer back to the original data output by `auto_read_file`, the 'ID' column was also detected as numeric and converted to a number data type instead of remaining a string like in the data above.\n",
|
||||
"We can perform type inference on our new dataflow using the `dataflow.builders` property. This property exposes different builders that can `learn` from a dataflow and `apply` the learning to produce a new dataflow, very similar to the pattern we used above for the `FileFormatBuilder`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ctb = dflow.builders.set_column_types()\n",
|
||||
"ctb.learn()\n",
|
||||
"ctb.conversion_candidates"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After learning `ctb.conversion_candidates` has been populated with information about the inferred types for each column, it is possible for there to be multiple candidate types per column, in this example there is only one type for each column.\n",
|
||||
"\n",
|
||||
"The candidates look correct, we only want to convert `ID` to be an integer column, so applying this `ColumnTypesBuilder` should result in a Dataflow with our columns converted to their respective types."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_converted = ctb.to_dataflow()\n",
|
||||
"\n",
|
||||
"df_converted = dflow_converted.to_pandas_dataframe()\n",
|
||||
"df_converted"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Auto Read File\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep has the ability to load different kinds of text files. The `auto_read_file` entry point can take any text based file (including excel, json and parquet) and auto-detect how to parse the file. It will also attempt to auto-detect the types of each column and apply type transformations to the columns it detects.\n",
|
||||
"\n",
|
||||
"The result will be a Dataflow object that has all the steps added that are required to read the given file(s) and convert their columns to the predicted types. No parameters are required beyond the file path or `FileDataSource` object."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_auto = dprep.auto_read_file('../data/crime_multiple_separators.csv')\n",
|
||||
"dflow_auto.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_auto1 = dprep.auto_read_file('../data/crime.xlsx')\n",
|
||||
"dflow_auto1.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_auto2 = dprep.auto_read_file('../data/crime.parquet')\n",
|
||||
"dflow_auto2.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Looking at the data, we can see that there are two empty columns either side of the 'Completed' column.\n",
|
||||
"If we compare the dataframe to a few rows from the original file:\n",
|
||||
"```\n",
|
||||
"ID |CaseNumber| |Completed|\n",
|
||||
"10140490 |HY329907| |Y|\n",
|
||||
"10139776 |HY329265| |Y|\n",
|
||||
"```\n",
|
||||
"We can see that the `|`'s have disappeared in the dataframe. This is because `|` is a very common separator character in csv files, so `auto_read_file` guessed it was the column separator. For this data we actually want the `|`'s to remain and instead use space as the column separator.\n",
|
||||
"\n",
|
||||
"To achieve this we can use `detect_file_format`. It takes a file path or datasource object and gives back a `FileFormatBuilder` which has learnt some information about the supplied data.\n",
|
||||
"This is what `auto_read_file` is using behind the scenes to 'learn' the contents of the given file and determine how to parse it. With the `FileFormatBuilder` we can take advantage of the intelligent learning aspect of `auto_read_file` but have the chance to modify some of the learnt information."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"ffb = dprep.detect_file_format('../data/crime_multiple_separators.csv')\n",
|
||||
"ffb_2 = dprep.detect_file_format('../data/crime.xlsx')\n",
|
||||
"ffb_3 = dprep.detect_file_format('../data/crime_fixed_width_file.txt')\n",
|
||||
"ffb_4 = dprep.detect_file_format('../data/json.json')\n",
|
||||
"\n",
|
||||
"print(ffb.file_format)\n",
|
||||
"print(ffb_2.file_format)\n",
|
||||
"print(ffb_3.file_format)\n",
|
||||
"print(type(ffb_4.file_format))"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After calling `detect_file_format` we get a `FileFormatBuilder` that has had `learn` called on it. This means the `file_format` attribute will be populated with a `<Parse|Read><type>Properties` object, it contains all the information that was learnt about the file. As we can see above different file types have corresponding file_formats detected. \n",
|
||||
"Continuing with our delimited example we can change any of these values and then call `ffb.to_dataflow()` to create a `Dataflow` that has the steps required to parse the datasource."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"ffb.file_format.separator = ' '\n",
|
||||
"dflow = ffb.to_dataflow()\n",
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The result is our desired dataframe with `|`'s included.\n",
|
||||
"\n",
|
||||
"If we refer back to the original data output by `auto_read_file`, the 'ID' column was also detected as numeric and converted to a number data type instead of remaining a string like in the data above.\n",
|
||||
"We can perform type inference on our new dataflow using the `dataflow.builders` property. This property exposes different builders that can `learn` from a dataflow and `apply` the learning to produce a new dataflow, very similar to the pattern we used above for the `FileFormatBuilder`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"ctb = dflow.builders.set_column_types()\n",
|
||||
"ctb.learn()\n",
|
||||
"ctb.conversion_candidates"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After learning `ctb.conversion_candidates` has been populated with information about the inferred types for each column, it is possible for there to be multiple candidate types per column, in this example there is only one type for each column.\n",
|
||||
"\n",
|
||||
"The candidates look correct, we only want to convert `ID` to be an integer column, so applying this `ColumnTypesBuilder` should result in a Dataflow with our columns converted to their respective types."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_converted = ctb.to_dataflow()\n",
|
||||
"\n",
|
||||
"df_converted = dflow_converted.to_pandas_dataframe()\n",
|
||||
"df_converted"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,194 +1,194 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Cache\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A Dataflow can be cached as a file on your disk during a local run by calling `dflow_cached = dflow.cache(directory_path)`. Doing this will run all the steps in the Dataflow, `dflow`, and save the cached data to the specified `directory_path`. The returned Dataflow, `dflow_cached`, has a Caching Step added at the end. Any subsequent runs on on the Dataflow `dflow_cached` will reuse the cached data, and the steps before the Caching Step will not be run again.\n",
|
||||
"\n",
|
||||
"Caching avoids running transforms multiple times, which can make local runs more efficient. Here are common places to use Caching:\n",
|
||||
"- after reading data from remote\n",
|
||||
"- after expensive transforms, such as Sort\n",
|
||||
"- after transforms that change the shape of data, such as Sampling, Filter and Summarize\n",
|
||||
"\n",
|
||||
"Caching Step will be ignored during scale-out run invoked by `to_spark_dataframe()`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will start by reading in a dataset and applying some transforms to the Dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.take_sample(probability=0.2, seed=7)\n",
|
||||
"dflow = dflow.sort_asc(columns='Primary Type')\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Case Number', 'Date', 'Primary Type'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we will choose a directory to store the cached data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))\n",
|
||||
"cache_dir"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will now call `dflow.cache(directory_path)` to cache the Dataflow to your directory."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_cached = dflow.cache(directory_path=cache_dir)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we will check steps in the `dflow_cached` to see that all of the previous steps were cached."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"[s.step_type for s in dflow_cached._get_steps()]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We also check the data stored in the cache directory."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"os.listdir(cache_dir)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Running against `dflow_cached` will reuse the cached data and skip running all of the previous steps again."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_cached.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Adding additional steps to `dflow_cached` will also reuse the cache data and skip running the steps prior to the Cache Step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_cached_take = dflow_cached.take(10)\n",
|
||||
"dflow_cached_skip = dflow_cached.skip(10).take(10)\n",
|
||||
"\n",
|
||||
"df_cached_take = dflow_cached_take.to_pandas_dataframe()\n",
|
||||
"df_cached_skip = dflow_cached_skip.to_pandas_dataframe()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# shutil.rmtree will then clean up the cached data \n",
|
||||
"import shutil\n",
|
||||
"shutil.rmtree(path=cache_dir)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Cache\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A Dataflow can be cached as a file on your disk during a local run by calling `dflow_cached = dflow.cache(directory_path)`. Doing this will run all the steps in the Dataflow, `dflow`, and save the cached data to the specified `directory_path`. The returned Dataflow, `dflow_cached`, has a Caching Step added at the end. Any subsequent runs on on the Dataflow `dflow_cached` will reuse the cached data, and the steps before the Caching Step will not be run again.\n",
|
||||
"\n",
|
||||
"Caching avoids running transforms multiple times, which can make local runs more efficient. Here are common places to use Caching:\n",
|
||||
"- after reading data from remote\n",
|
||||
"- after expensive transforms, such as Sort\n",
|
||||
"- after transforms that change the shape of data, such as Sampling, Filter and Summarize\n",
|
||||
"\n",
|
||||
"Caching Step will be ignored during scale-out run invoked by `to_spark_dataframe()`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will start by reading in a dataset and applying some transforms to the Dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.take_sample(probability=0.2, seed=7)\n",
|
||||
"dflow = dflow.sort_asc(columns='Primary Type')\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Case Number', 'Date', 'Primary Type'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we will choose a directory to store the cached data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))\n",
|
||||
"cache_dir"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will now call `dflow.cache(directory_path)` to cache the Dataflow to your directory."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_cached = dflow.cache(directory_path=cache_dir)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we will check steps in the `dflow_cached` to see that all of the previous steps were cached."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"[s.step_type for s in dflow_cached._get_steps()]"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We also check the data stored in the cache directory."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"os.listdir(cache_dir)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Running against `dflow_cached` will reuse the cached data and skip running all of the previous steps again."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_cached.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Adding additional steps to `dflow_cached` will also reuse the cache data and skip running the steps prior to the Cache Step."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_cached_take = dflow_cached.take(10)\n",
|
||||
"dflow_cached_skip = dflow_cached.skip(10).take(10)\n",
|
||||
"\n",
|
||||
"df_cached_take = dflow_cached_take.to_pandas_dataframe()\n",
|
||||
"df_cached_skip = dflow_cached_skip.to_pandas_dataframe()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# shutil.rmtree will then clean up the cached data \n",
|
||||
"import shutil\n",
|
||||
"shutil.rmtree(path=cache_dir)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,473 +1,473 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Column Type Transforms\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When consuming a data set, it is highly useful to know as much as possible about the data. Column types can help you understand more about each column, and enable type-specific transformations later. This provides much more insight than treating all data as strings.\n",
|
||||
"\n",
|
||||
"In this notebook, you will learn about:\n",
|
||||
"- [Built-in column types](#types)\n",
|
||||
"- How to:\n",
|
||||
" - [Convert to long (integer)](#long)\n",
|
||||
" - [Convert to double (floating point or decimal number)](#double)\n",
|
||||
" - [Convert to boolean](#boolean)\n",
|
||||
" - [Convert to datetime](#datetime)\n",
|
||||
"- [How to use `ColumnTypesBuilder` to get suggested column types and convert them](#builder)\n",
|
||||
"- [How to convert column type for multiple columns if types are known](#multiple-columns)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Set up"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_csv('../data/crime-winter.csv')\n",
|
||||
"dflow = dflow.keep_columns(['Case Number', 'Date', 'IUCR', 'Arrest', 'Longitude', 'Latitude'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"types\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Built-in column types"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Currently, Data Prep supports the following column types: string, long (integer), double (floating point or decimal number), boolean, and datetime.\n",
|
||||
"\n",
|
||||
"In the previous step, a data set was read in as a Dataflow, with only a few interesting columns kept. We will use this Dataflow to explore column types throughout the notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"From the first few rows of the Dataflow, you can see that the columns contain different types of data. However, by looking at `dtypes`, you can see that `read_csv()` treats all columns as string columns.\n",
|
||||
"\n",
|
||||
"Note that `auto_read_file()` is a data ingestion function that infers column types. Learn more about it [here](./auto-read-file.ipynb)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.dtypes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"long\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Converting to long (integer)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"IUCR\" column should only contain integers. You can call `to_long` to convert the column type of \"IUCR\" to `FieldType.INTEGER`. If you look at the data profile ([learn more about data profiles](./data-profile.ipynb)), you will see numeric metrics populated for that column such as mean, variance, quantiles, etc. This is helpful for understanding the shape and distribution of numeric data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_conversion = dflow.to_long('IUCR')\n",
|
||||
"profile = dflow_conversion.get_profile()\n",
|
||||
"profile"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"double\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Converting to double (floating point or decimal number)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"Latitude\" and \"Longitude\" columns should only contain decimal numbers. You can call `to_double` to convert the column type of \"Latitude\" and \"Longitude\" to `FieldType.DECIMAL`. In the data profile, you will see numeric metrics populated for these columns as well. Note that after converting the column types, you can see that there are missing values in these columns. Metrics like this can be helpful for noticing issues with the data set."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_conversion = dflow_conversion.to_number(['Latitude', 'Longitude'])\n",
|
||||
"profile = dflow_conversion.get_profile()\n",
|
||||
"profile"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"boolean\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Converting to boolean"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"Arrest\" column should only contain boolean values. You can call `to_bool` to convert the column type of \"Arrest\" to `FieldType.BOOLEAN`.\n",
|
||||
"\n",
|
||||
"The `to_bool` function allows you to specify which values should map to `True` and which values should map to `False`. To do so, you can provide those values in an array as parameters `true_values` and `false_values`. Additionally, you can specify whether all other values should become `True`, `False` or Error by using the `mismatch_as` parameter."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_conversion.to_bool('Arrest', \n",
|
||||
" true_values=[1],\n",
|
||||
" false_values=[0],\n",
|
||||
" mismatch_as=dprep.MismatchAsOption.ASERROR).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the previous conversion, all the values in the \"Arrest\" column became `DataPrepError`, because 'FALSE' didn't match any of the `false_values` nor any of the `true_values`, and all the unmatched values were set to become errors. Let's try the conversion again with different `false_values`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_conversion = dflow_conversion.to_bool('Arrest',\n",
|
||||
" true_values=['1', 'TRUE'],\n",
|
||||
" false_values=['0', 'FALSE'],\n",
|
||||
" mismatch_as=dprep.MismatchAsOption.ASERROR)\n",
|
||||
"dflow_conversion.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This time, all the string values 'FALSE' have been successfully converted to the boolean value `False`. Take another look at the data profile."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile = dflow_conversion.get_profile()\n",
|
||||
"profile"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"datetime\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"Date\" column should only contain datetime values. You can convert its column type to `FieldType.DateTime` using the `to_datetime` function. Typically, datetime formats can be confusing or inconsistent. Next, we will show you all the tools that can help correctly converting the column to `DateTime`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the first example, directly call `to_datetime` with only the column name. Data Prep will inspect the data in this column and learn what format should be used for the conversion.\n",
|
||||
"\n",
|
||||
"Note that if there is data in the column that cannot be converted to datetime, an Error value will be created in that cell."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_conversion_date = dflow_conversion.to_datetime('Date')\n",
|
||||
"dflow_conversion_date.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case, we can see that '1/10/2016 11:00' was converted using the format `%m/%d/%Y %H:%M`.\n",
|
||||
"\n",
|
||||
"The data in this column is actually somewhat ambiguous. Should the dates be 'October 1' or 'January 10'? The function `to_datetime` determines that both are possible, but defaults to month-first (US format).\n",
|
||||
"\n",
|
||||
"If the data was supposed to be day-first, you can customize the conversion."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_alternate_conversion = dflow_conversion.to_datetime('Date', date_time_formats=['%d/%m/%Y %H:%M'])\n",
|
||||
"dflow_alternate_conversion.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"builder\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using `ColumnTypesBuilder`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep can help you automatically detect what are the likely column types.\n",
|
||||
"\n",
|
||||
"You can call `dflow.builders.set_column_types()` to get a `ColumnTypesBuilder`. Then, calling `learn()` on it will trigger Data Prep to inspect the data in each column. As a result, you can see the suggested column types for each column (conversion candidates)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.set_column_types()\n",
|
||||
"builder.learn()\n",
|
||||
"builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case, Data Prep suggested the correct column types for \"Arrest\", \"Case Number\", \"Latitude\", and \"Longitude\".\n",
|
||||
"\n",
|
||||
"However, for \"Date\", it has suggested two possible date formats: month-first, or day-first. The ambiguity must be resolved before you complete the conversion. To use the month-first format, you can call `builder.ambiguous_date_conversions_keep_month_day()`. Otherwise, call `builder.ambiguous_date_conversions_keep_day_month()`. Note that if there were multiple datetime columns with ambiguous date conversions, calling one of these functions will apply the resolution to all of them.\n",
|
||||
"\n",
|
||||
"If you want to skip all the ambiguous date column conversions instead, you can call: `builder.ambiguous_date_conversions_drop()`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.ambiguous_date_conversions_keep_month_day()\n",
|
||||
"builder.conversion_candidates"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The conversion candidate for \"IUCR\" is currently `FieldType.INTEGER`. If you know that \"IUCR\" should be floating point (called `FieldType.DECIMAL`), you can tweak the builder to change the conversion candidate for that specific column. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.conversion_candidates['IUCR'] = dprep.FieldType.DECIMAL\n",
|
||||
"builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case we are happy with \"IUCR\" as `FieldType.INTEGER`. So we set it back. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.conversion_candidates['IUCR'] = dprep.FieldType.INTEGER\n",
|
||||
"builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once you are happy with the conversion candidates, you can complete the conversion by calling `builder.to_dataflow()`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_converion_using_builder = builder.to_dataflow()\n",
|
||||
"dflow_converion_using_builder.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"multiple-columns\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Convert column types for multiple columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you already know the column types, you can simply call `dflow.set_column_types()`. This function allows you to specify multiple columns, and the desired column type for each one. Here's how you can convert all five columns at once.\n",
|
||||
"\n",
|
||||
"Note that `set_column_types` only supports a subset of column type conversions. For example, we cannot specify the true/false values for a boolean conversion, so the results of this operation is incorrect for the \"Arrest\" column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_conversion_using_set = dflow.set_column_types({\n",
|
||||
" 'IUCR': dprep.FieldType.INTEGER,\n",
|
||||
" 'Latitude': dprep.FieldType.DECIMAL,\n",
|
||||
" 'Longitude': dprep.FieldType.DECIMAL,\n",
|
||||
" 'Arrest': dprep.FieldType.BOOLEAN,\n",
|
||||
" 'Date': (dprep.FieldType.DATE, ['%m/%d/%Y %H:%M']),\n",
|
||||
"})\n",
|
||||
"dflow_conversion_using_set.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Column Type Transforms\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When consuming a data set, it is highly useful to know as much as possible about the data. Column types can help you understand more about each column, and enable type-specific transformations later. This provides much more insight than treating all data as strings.\n",
|
||||
"\n",
|
||||
"In this notebook, you will learn about:\n",
|
||||
"- [Built-in column types](#types)\n",
|
||||
"- How to:\n",
|
||||
" - [Convert to long (integer)](#long)\n",
|
||||
" - [Convert to double (floating point or decimal number)](#double)\n",
|
||||
" - [Convert to boolean](#boolean)\n",
|
||||
" - [Convert to datetime](#datetime)\n",
|
||||
"- [How to use `ColumnTypesBuilder` to get suggested column types and convert them](#builder)\n",
|
||||
"- [How to convert column type for multiple columns if types are known](#multiple-columns)"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Set up"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_csv('../data/crime-winter.csv')\n",
|
||||
"dflow = dflow.keep_columns(['Case Number', 'Date', 'IUCR', 'Arrest', 'Longitude', 'Latitude'])"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"types\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Built-in column types"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Currently, Data Prep supports the following column types: string, long (integer), double (floating point or decimal number), boolean, and datetime.\n",
|
||||
"\n",
|
||||
"In the previous step, a data set was read in as a Dataflow, with only a few interesting columns kept. We will use this Dataflow to explore column types throughout the notebook."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"From the first few rows of the Dataflow, you can see that the columns contain different types of data. However, by looking at `dtypes`, you can see that `read_csv()` treats all columns as string columns.\n",
|
||||
"\n",
|
||||
"Note that `auto_read_file()` is a data ingestion function that infers column types. Learn more about it [here](./auto-read-file.ipynb)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.dtypes"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"long\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Converting to long (integer)"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"IUCR\" column should only contain integers. You can call `to_long` to convert the column type of \"IUCR\" to `FieldType.INTEGER`. If you look at the data profile ([learn more about data profiles](./data-profile.ipynb)), you will see numeric metrics populated for that column such as mean, variance, quantiles, etc. This is helpful for understanding the shape and distribution of numeric data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_conversion = dflow.to_long('IUCR')\n",
|
||||
"profile = dflow_conversion.get_profile()\n",
|
||||
"profile"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"double\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Converting to double (floating point or decimal number)"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"Latitude\" and \"Longitude\" columns should only contain decimal numbers. You can call `to_double` to convert the column type of \"Latitude\" and \"Longitude\" to `FieldType.DECIMAL`. In the data profile, you will see numeric metrics populated for these columns as well. Note that after converting the column types, you can see that there are missing values in these columns. Metrics like this can be helpful for noticing issues with the data set."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_conversion = dflow_conversion.to_number(['Latitude', 'Longitude'])\n",
|
||||
"profile = dflow_conversion.get_profile()\n",
|
||||
"profile"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"boolean\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Converting to boolean"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"Arrest\" column should only contain boolean values. You can call `to_bool` to convert the column type of \"Arrest\" to `FieldType.BOOLEAN`.\n",
|
||||
"\n",
|
||||
"The `to_bool` function allows you to specify which values should map to `True` and which values should map to `False`. To do so, you can provide those values in an array as parameters `true_values` and `false_values`. Additionally, you can specify whether all other values should become `True`, `False` or Error by using the `mismatch_as` parameter."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_conversion.to_bool('Arrest', \n",
|
||||
" true_values=[1],\n",
|
||||
" false_values=[0],\n",
|
||||
" mismatch_as=dprep.MismatchAsOption.ASERROR).head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the previous conversion, all the values in the \"Arrest\" column became `DataPrepError`, because 'FALSE' didn't match any of the `false_values` nor any of the `true_values`, and all the unmatched values were set to become errors. Let's try the conversion again with different `false_values`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_conversion = dflow_conversion.to_bool('Arrest',\n",
|
||||
" true_values=['1', 'TRUE'],\n",
|
||||
" false_values=['0', 'FALSE'],\n",
|
||||
" mismatch_as=dprep.MismatchAsOption.ASERROR)\n",
|
||||
"dflow_conversion.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This time, all the string values 'FALSE' have been successfully converted to the boolean value `False`. Take another look at the data profile."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile = dflow_conversion.get_profile()\n",
|
||||
"profile"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"datetime\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Suppose the \"Date\" column should only contain datetime values. You can convert its column type to `FieldType.DateTime` using the `to_datetime` function. Typically, datetime formats can be confusing or inconsistent. Next, we will show you all the tools that can help correctly converting the column to `DateTime`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the first example, directly call `to_datetime` with only the column name. Data Prep will inspect the data in this column and learn what format should be used for the conversion.\n",
|
||||
"\n",
|
||||
"Note that if there is data in the column that cannot be converted to datetime, an Error value will be created in that cell."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_conversion_date = dflow_conversion.to_datetime('Date')\n",
|
||||
"dflow_conversion_date.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case, we can see that '1/10/2016 11:00' was converted using the format `%m/%d/%Y %H:%M`.\n",
|
||||
"\n",
|
||||
"The data in this column is actually somewhat ambiguous. Should the dates be 'October 1' or 'January 10'? The function `to_datetime` determines that both are possible, but defaults to month-first (US format).\n",
|
||||
"\n",
|
||||
"If the data was supposed to be day-first, you can customize the conversion."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_alternate_conversion = dflow_conversion.to_datetime('Date', date_time_formats=['%d/%m/%Y %H:%M'])\n",
|
||||
"dflow_alternate_conversion.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"builder\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using `ColumnTypesBuilder`"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep can help you automatically detect what are the likely column types.\n",
|
||||
"\n",
|
||||
"You can call `dflow.builders.set_column_types()` to get a `ColumnTypesBuilder`. Then, calling `learn()` on it will trigger Data Prep to inspect the data in each column. As a result, you can see the suggested column types for each column (conversion candidates)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.set_column_types()\n",
|
||||
"builder.learn()\n",
|
||||
"builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case, Data Prep suggested the correct column types for \"Arrest\", \"Case Number\", \"Latitude\", and \"Longitude\".\n",
|
||||
"\n",
|
||||
"However, for \"Date\", it has suggested two possible date formats: month-first, or day-first. The ambiguity must be resolved before you complete the conversion. To use the month-first format, you can call `builder.ambiguous_date_conversions_keep_month_day()`. Otherwise, call `builder.ambiguous_date_conversions_keep_day_month()`. Note that if there were multiple datetime columns with ambiguous date conversions, calling one of these functions will apply the resolution to all of them.\n",
|
||||
"\n",
|
||||
"If you want to skip all the ambiguous date column conversions instead, you can call: `builder.ambiguous_date_conversions_drop()`"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.ambiguous_date_conversions_keep_month_day()\n",
|
||||
"builder.conversion_candidates"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The conversion candidate for \"IUCR\" is currently `FieldType.INTEGER`. If you know that \"IUCR\" should be floating point (called `FieldType.DECIMAL`), you can tweak the builder to change the conversion candidate for that specific column. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.conversion_candidates['IUCR'] = dprep.FieldType.DECIMAL\n",
|
||||
"builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case we are happy with \"IUCR\" as `FieldType.INTEGER`. So we set it back. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.conversion_candidates['IUCR'] = dprep.FieldType.INTEGER\n",
|
||||
"builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once you are happy with the conversion candidates, you can complete the conversion by calling `builder.to_dataflow()`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_converion_using_builder = builder.to_dataflow()\n",
|
||||
"dflow_converion_using_builder.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"multiple-columns\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Convert column types for multiple columns"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you already know the column types, you can simply call `dflow.set_column_types()`. This function allows you to specify multiple columns, and the desired column type for each one. Here's how you can convert all five columns at once.\n",
|
||||
"\n",
|
||||
"Note that `set_column_types` only supports a subset of column type conversions. For example, we cannot specify the true/false values for a boolean conversion, so the results of this operation is incorrect for the \"Arrest\" column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_conversion_using_set = dflow.set_column_types({\n",
|
||||
" 'IUCR': dprep.FieldType.INTEGER,\n",
|
||||
" 'Latitude': dprep.FieldType.DECIMAL,\n",
|
||||
" 'Longitude': dprep.FieldType.DECIMAL,\n",
|
||||
" 'Arrest': dprep.FieldType.BOOLEAN,\n",
|
||||
" 'Date': (dprep.FieldType.DATE, ['%m/%d/%Y %H:%M']),\n",
|
||||
"})\n",
|
||||
"dflow_conversion_using_set.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,231 +1,231 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Custom Python Transforms\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There will be scenarios when the easiest thing for you to do is just to write some Python code. This SDK provides three extension points that you can use.\n",
|
||||
"\n",
|
||||
"1. New Script Column\n",
|
||||
"2. New Script Filter\n",
|
||||
"3. Transform Partition\n",
|
||||
"\n",
|
||||
"Each of these are supported in both the scale-up and the scale-out runtime. A key advantage of using these extension points is that you don't need to pull all of the data in order to create a dataframe. Your custom python code will be run just like other transforms, at scale, by partition, and typically in parallel."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initial data prep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We start by loading crime data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"col = dprep.col\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We trim the dataset down and keep only the columns we are interested in. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.keep_columns(['Case Number','Primary Type', 'Description', 'Latitude', 'Longitude'])\n",
|
||||
"dflow = dflow.replace_na(columns=['Latitude', 'Longitude'], custom_na_list='')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We look for null values using a filter. We found some, so now we'll look at a way to fill these missing values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.filter(col('Latitude').is_null()).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Transform Partition"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We want to replace all null values with a 0, so we decide to use a handy pandas function. This code will be run by partition, not on all of the dataset at a time. This means that on a large dataset, this code may run in parallel as the runtime processes the data partition by partition."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pt_dflow = dflow\n",
|
||||
"dflow = pt_dflow.transform_partition(\"\"\"\n",
|
||||
"def transform(df, index):\n",
|
||||
" df['Latitude'].fillna('0',inplace=True)\n",
|
||||
" df['Longitude'].fillna('0',inplace=True)\n",
|
||||
" return df\n",
|
||||
"\"\"\")\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Transform Partition With File"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Being able to use any python code to manipulate your data as a pandas DataFrame is extremely useful for complex and specific data operations that DataPrep doesn't handle natively. Though the code isn't very testable unfortunately, it's just sitting inside a string.\n",
|
||||
"So to improve code testability and ease of script writing there is another transform_partiton interface that takes the path to a python script which must contain a function matching the 'transform' signature defined above.\n",
|
||||
"\n",
|
||||
"The `script_path` argument should be a relative path to ensure Dataflow portability. Here `map_func.py` contains the same code as in the previous example."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = pt_dflow.transform_partition_with_file('../data/map_func.py')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## New Script Column"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We want to create a new column that has both the latitude and longitude. We can achieve it easily using [Data Prep expression](./add-column-using-expression.ipynb), which is faster in execution. Alternatively, We can do this using Python code by using the `new_script_column()` method on the dataflow. Note that we use custom Python code here for demo purpose only. In practise, you should always use Data Prep native functions as a preferred method, and use custom Python code when the functionality is not available in Data Prep. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.new_script_column(new_column_name='coordinates', insert_after='Longitude', script=\"\"\"\n",
|
||||
"def newvalue(row):\n",
|
||||
" return '(' + row['Latitude'] + ', ' + row['Longitude'] + ')'\n",
|
||||
"\"\"\")\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## New Script Filter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we want to filter the dataset down to only the crimes that incurred over $300 in loss. We can build a Python expression that returns True if we want to keep the row, and False to drop the row."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.new_script_filter(\"\"\"\n",
|
||||
"def includerow(row):\n",
|
||||
" val = row['Description']\n",
|
||||
" return 'OVER $ 300' in val\n",
|
||||
"\"\"\")\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Custom Python Transforms\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There will be scenarios when the easiest thing for you to do is just to write some Python code. This SDK provides three extension points that you can use.\n",
|
||||
"\n",
|
||||
"1. New Script Column\n",
|
||||
"2. New Script Filter\n",
|
||||
"3. Transform Partition\n",
|
||||
"\n",
|
||||
"Each of these are supported in both the scale-up and the scale-out runtime. A key advantage of using these extension points is that you don't need to pull all of the data in order to create a dataframe. Your custom python code will be run just like other transforms, at scale, by partition, and typically in parallel."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initial data prep"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We start by loading crime data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"col = dprep.col\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We trim the dataset down and keep only the columns we are interested in. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.keep_columns(['Case Number','Primary Type', 'Description', 'Latitude', 'Longitude'])\n",
|
||||
"dflow = dflow.replace_na(columns=['Latitude', 'Longitude'], custom_na_list='')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We look for null values using a filter. We found some, so now we'll look at a way to fill these missing values."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.filter(col('Latitude').is_null()).head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Transform Partition"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We want to replace all null values with a 0, so we decide to use a handy pandas function. This code will be run by partition, not on all of the dataset at a time. This means that on a large dataset, this code may run in parallel as the runtime processes the data partition by partition."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"pt_dflow = dflow\n",
|
||||
"dflow = pt_dflow.transform_partition(\"\"\"\n",
|
||||
"def transform(df, index):\n",
|
||||
" df['Latitude'].fillna('0',inplace=True)\n",
|
||||
" df['Longitude'].fillna('0',inplace=True)\n",
|
||||
" return df\n",
|
||||
"\"\"\")\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Transform Partition With File"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Being able to use any python code to manipulate your data as a pandas DataFrame is extremely useful for complex and specific data operations that DataPrep doesn't handle natively. Though the code isn't very testable unfortunately, it's just sitting inside a string.\n",
|
||||
"So to improve code testability and ease of script writing there is another transform_partiton interface that takes the path to a python script which must contain a function matching the 'transform' signature defined above.\n",
|
||||
"\n",
|
||||
"The `script_path` argument should be a relative path to ensure Dataflow portability. Here `map_func.py` contains the same code as in the previous example."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = pt_dflow.transform_partition_with_file('../data/map_func.py')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## New Script Column"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We want to create a new column that has both the latitude and longitude. We can achieve it easily using [Data Prep expression](./add-column-using-expression.ipynb), which is faster in execution. Alternatively, We can do this using Python code by using the `new_script_column()` method on the dataflow. Note that we use custom Python code here for demo purpose only. In practise, you should always use Data Prep native functions as a preferred method, and use custom Python code when the functionality is not available in Data Prep. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.new_script_column(new_column_name='coordinates', insert_after='Longitude', script=\"\"\"\n",
|
||||
"def newvalue(row):\n",
|
||||
" return '(' + row['Latitude'] + ', ' + row['Longitude'] + ')'\n",
|
||||
"\"\"\")\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## New Script Filter"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we want to filter the dataset down to only the crimes that incurred over $300 in loss. We can build a Python expression that returns True if we want to keep the row, and False to drop the row."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.new_script_filter(\"\"\"\n",
|
||||
"def includerow(row):\n",
|
||||
" val = row['Description']\n",
|
||||
" return 'OVER $ 300' in val\n",
|
||||
"\"\"\")\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,179 +1,179 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Data Profile\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n",
|
||||
"- Understand the input data.\n",
|
||||
"- Determine which columns might need further preparation.\n",
|
||||
"- Verify that data preparation operations produced the desired result.\n",
|
||||
"\n",
|
||||
"`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
|
||||
"\n",
|
||||
"profile = dflow.get_profile()\n",
|
||||
"profile"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile.columns['Beat']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also extract and filter data from profiles by using list and dict comprehensions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"variances = [c.variance for c in profile.columns.values() if c.variance]\n",
|
||||
"variances"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"column_types = {c.name: c.type for c in profile.columns.values()}\n",
|
||||
"column_types"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile.columns['Primary Type'].value_counts"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Numeric ColumnProfiles include an estimated histogram of the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile.columns['District'].histogram"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n",
|
||||
"profile_more_bins.columns['District'].histogram"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For columns containing data of mixed types, the ColumnProfile also provides counts of each type."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile.columns['X Coordinate'].type_counts"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Data Profile\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n",
|
||||
"- Understand the input data.\n",
|
||||
"- Determine which columns might need further preparation.\n",
|
||||
"- Verify that data preparation operations produced the desired result.\n",
|
||||
"\n",
|
||||
"`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
|
||||
"\n",
|
||||
"profile = dflow.get_profile()\n",
|
||||
"profile"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile.columns['Beat']"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also extract and filter data from profiles by using list and dict comprehensions."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"variances = [c.variance for c in profile.columns.values() if c.variance]\n",
|
||||
"variances"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"column_types = {c.name: c.type for c in profile.columns.values()}\n",
|
||||
"column_types"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile.columns['Primary Type'].value_counts"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Numeric ColumnProfiles include an estimated histogram of the data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile.columns['District'].histogram"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n",
|
||||
"profile_more_bins.columns['District'].histogram"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For columns containing data of mixed types, the ColumnProfile also provides counts of each type."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile.columns['X Coordinate'].type_counts"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,215 +1,215 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Reading from and Writing to Datastores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A datastore is a reference that points to an Azure storage service like a blob container for example. It belongs to a workspace and a workspace can have many datastores.\n",
|
||||
"\n",
|
||||
"A data path points to a path on the underlying Azure storage service the datastore references. For example, given a datastore named `blob` that points to an Azure blob container, a data path can point to `/test/data/titanic.csv` in the blob container."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Read data from Datastore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep supports reading data from a `Datastore` or a `DataPath` or a `DataReference`. \n",
|
||||
"\n",
|
||||
"Passing in a datastore into all the `read_*` methods of Data Prep will result in reading everything in the underlying Azure storage service. To read a specific folder or file in the underlying storage, you have to pass in a data reference."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Workspace, Datastore\n",
|
||||
"from azureml.data.datapath import DataPath\n",
|
||||
"\n",
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, get or create a workspace. Feel free to replace `subscription_id`, `resource_group`, and `workspace_name` with other values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"subscription_id = '35f16a99-532a-4a47-9e93-00305f6c40f2'\n",
|
||||
"resource_group = 'DataStoreTest'\n",
|
||||
"workspace_name = 'dataprep-centraleuap'\n",
|
||||
"\n",
|
||||
"workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"workspace.datastores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can now read a crime data set from the datastore. If you are using your own workspace, the `crime0-10.csv` will not be there by default. You will have to upload the data to the datastore yourself."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='dataprep_blob')\n",
|
||||
"dflow = dprep.read_csv(path=datastore.path('crime0-10.csv'))\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also read from an Azure SQL database. To do that, you will first get an Azure SQL database datastore instance and pass it to Data Prep for reading."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='test_sql')\n",
|
||||
"dflow_sql = dprep.read_sql(data_source=datastore, query='SELECT * FROM team')\n",
|
||||
"dflow_sql.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also read from a PostgreSQL database. To do that, you will first get a PostgreSQL database datastore instance and pass it to Data Prep for reading."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='postgre_test')\n",
|
||||
"dflow_sql = dprep.read_postgresql(data_source=datastore, query='SELECT * FROM public.people')\n",
|
||||
"dflow_sql.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Write data to Datastore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also write a dataflow to a datastore. The code below will write the file you read in earlier to the folder in the datastore."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dest_datastore = Datastore(workspace, 'dataprep_blob_key')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.write_to_csv(directory_path=dest_datastore.path('output/crime0-10')).run_local()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you can read all the files in the `dataprep_adls` datastore which references an Azure Data Lake store."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='dataprep_adls')\n",
|
||||
"dflow_adls = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/input/crime0-10.csv'))\n",
|
||||
"dflow_adls.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Reading from and Writing to Datastores"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A datastore is a reference that points to an Azure storage service like a blob container for example. It belongs to a workspace and a workspace can have many datastores.\n",
|
||||
"\n",
|
||||
"A data path points to a path on the underlying Azure storage service the datastore references. For example, given a datastore named `blob` that points to an Azure blob container, a data path can point to `/test/data/titanic.csv` in the blob container."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Read data from Datastore"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep supports reading data from a `Datastore` or a `DataPath` or a `DataReference`. \n",
|
||||
"\n",
|
||||
"Passing in a datastore into all the `read_*` methods of Data Prep will result in reading everything in the underlying Azure storage service. To read a specific folder or file in the underlying storage, you have to pass in a data reference."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.core import Workspace, Datastore\n",
|
||||
"from azureml.data.datapath import DataPath\n",
|
||||
"\n",
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, get or create a workspace. Feel free to replace `subscription_id`, `resource_group`, and `workspace_name` with other values."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"subscription_id = '35f16a99-532a-4a47-9e93-00305f6c40f2'\n",
|
||||
"resource_group = 'DataStoreTest'\n",
|
||||
"workspace_name = 'dataprep-centraleuap'\n",
|
||||
"\n",
|
||||
"workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"workspace.datastores"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can now read a crime data set from the datastore. If you are using your own workspace, the `crime0-10.csv` will not be there by default. You will have to upload the data to the datastore yourself."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='dataprep_blob')\n",
|
||||
"dflow = dprep.read_csv(path=datastore.path('crime0-10.csv'))\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also read from an Azure SQL database. To do that, you will first get an Azure SQL database datastore instance and pass it to Data Prep for reading."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='test_sql')\n",
|
||||
"dflow_sql = dprep.read_sql(data_source=datastore, query='SELECT * FROM team')\n",
|
||||
"dflow_sql.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also read from a PostgreSQL database. To do that, you will first get a PostgreSQL database datastore instance and pass it to Data Prep for reading."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='postgre_test')\n",
|
||||
"dflow_sql = dprep.read_postgresql(data_source=datastore, query='SELECT * FROM public.people')\n",
|
||||
"dflow_sql.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Write data to Datastore"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also write a dataflow to a datastore. The code below will write the file you read in earlier to the folder in the datastore."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dest_datastore = Datastore(workspace, 'dataprep_blob_key')"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.write_to_csv(directory_path=dest_datastore.path('output/crime0-10')).run_local()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you can read all the files in the `dataprep_adls` datastore which references an Azure Data Lake store."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"datastore = Datastore(workspace=workspace, name='dataprep_adls')\n",
|
||||
"dflow_adls = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/input/crime0-10.csv'))\n",
|
||||
"dflow_adls.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,187 +1,187 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Derive Column By Example\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One of the more advanced tools in Data Prep is the ability to derive columns by providing examples of desired results and letting Data Prep generate code to achieve the intended derivation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_csv(path = '../data/crime-spring.csv')\n",
|
||||
"df = dflow.head(5)\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see, this is a fairly simple file, but let's assume that we need to be able to join this with a dataset where date and time come in a format 'Apr 4, 2016 | 10PM-12AM'.\n",
|
||||
"\n",
|
||||
"Let's wrangle the data into the shape we need."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.derive_column_by_example(source_columns = ['Date'], new_column_name = 'date_timerange')\n",
|
||||
"builder.add_example(source_data = df.iloc[0], example_value = 'Apr 4, 2016 10PM-12AM')\n",
|
||||
"builder.preview() # will preview top 10 rows"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The code above first creates a builder for the derived column by providing an array of source columns to consider ('DATE') and name for the new column to be added.\n",
|
||||
"\n",
|
||||
"Then, we provide the first example by passing in the first row (index 0) of the DataFrame printed above and giving an expected value for the derived column.\n",
|
||||
"\n",
|
||||
"Finally, we call `builder.preview()` and observe the derived column next to the source column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Everything looks good here. However, we just noticed that it's not quite what we wanted. We forgot to separate date and time range by '|' to generate the format we need.\n",
|
||||
"\n",
|
||||
"To fix that, we will add another example. This time, instead of passing in a row from the preview, we just construct a dictionary of column name to value for the source_data parameter."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.add_example(source_data = {'Date': '4/15/2016 10:00'}, example_value = 'Apr 15, 2016 | 10AM-12PM')\n",
|
||||
"builder.preview()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This clearly had negative effects, as now the only rows that have any values in derived column are the ones that match exactly with the examples we have provided.\n",
|
||||
"\n",
|
||||
"Let's look at the examples:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"examples = builder.list_examples()\n",
|
||||
"examples"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we can see that we have provided inconsistent examples. To fix the issue, we need to replace the first example with a correct one (including '|' between date and time).\n",
|
||||
"\n",
|
||||
"We can achieve this by deleting examples that are incorrect (by either passing in example_row from examples DataFrame, or by just passing in example_id value) and then adding new modified examples back."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.delete_example(example_id = -1)\n",
|
||||
"builder.add_example(examples.iloc[0], 'Apr 4, 2016 | 10PM-12AM')\n",
|
||||
"builder.preview()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now this looks correct and we can finally call to_dataflow() on the builder, which would return a dataflow with the desired derived columns added."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = builder.to_dataflow()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Derive Column By Example\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One of the more advanced tools in Data Prep is the ability to derive columns by providing examples of desired results and letting Data Prep generate code to achieve the intended derivation."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_csv(path = '../data/crime-spring.csv')\n",
|
||||
"df = dflow.head(5)\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see, this is a fairly simple file, but let's assume that we need to be able to join this with a dataset where date and time come in a format 'Apr 4, 2016 | 10PM-12AM'.\n",
|
||||
"\n",
|
||||
"Let's wrangle the data into the shape we need."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.derive_column_by_example(source_columns = ['Date'], new_column_name = 'date_timerange')\n",
|
||||
"builder.add_example(source_data = df.iloc[0], example_value = 'Apr 4, 2016 10PM-12AM')\n",
|
||||
"builder.preview() # will preview top 10 rows"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The code above first creates a builder for the derived column by providing an array of source columns to consider ('DATE') and name for the new column to be added.\n",
|
||||
"\n",
|
||||
"Then, we provide the first example by passing in the first row (index 0) of the DataFrame printed above and giving an expected value for the derived column.\n",
|
||||
"\n",
|
||||
"Finally, we call `builder.preview()` and observe the derived column next to the source column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Everything looks good here. However, we just noticed that it's not quite what we wanted. We forgot to separate date and time range by '|' to generate the format we need.\n",
|
||||
"\n",
|
||||
"To fix that, we will add another example. This time, instead of passing in a row from the preview, we just construct a dictionary of column name to value for the source_data parameter."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.add_example(source_data = {'Date': '4/15/2016 10:00'}, example_value = 'Apr 15, 2016 | 10AM-12PM')\n",
|
||||
"builder.preview()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This clearly had negative effects, as now the only rows that have any values in derived column are the ones that match exactly with the examples we have provided.\n",
|
||||
"\n",
|
||||
"Let's look at the examples:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"examples = builder.list_examples()\n",
|
||||
"examples"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we can see that we have provided inconsistent examples. To fix the issue, we need to replace the first example with a correct one (including '|' between date and time).\n",
|
||||
"\n",
|
||||
"We can achieve this by deleting examples that are incorrect (by either passing in example_row from examples DataFrame, or by just passing in example_id value) and then adding new modified examples back."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.delete_example(example_id = -1)\n",
|
||||
"builder.add_example(examples.iloc[0], 'Apr 4, 2016 | 10PM-12AM')\n",
|
||||
"builder.preview()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now this looks correct and we can finally call to_dataflow() on the builder, which would return a dataflow with the desired derived columns added."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = builder.to_dataflow()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,118 +1,118 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# External References\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In addition to opening existing Dataflows in code and modifying them, it is also possible to create and persist Dataflows that reference another Dataflow that has been persisted to a .dprep file. In this case, executing this Dataflow will load and execute the referenced Dataflow dynamically, and then execute the steps in the referencing Dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To demonstrate, we will create a Dataflow that loads and transforms some data. After that, we will persist this Dataflow to disk. To learn more about saving and opening .dprep files, see: [Opening and Saving Dataflows](./open-save-dataflows.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"import tempfile\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"dflow = dprep.auto_read_file('../data/crime.txt')\n",
|
||||
"dflow = dflow.drop_errors(['Column7', 'Column8', 'Column9'], dprep.ColumnRelationship.ANY)\n",
|
||||
"dflow_path = os.path.join(tempfile.gettempdir(), 'package.dprep')\n",
|
||||
"dflow.save(dflow_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that we have a .dprep file, we can create a new Dataflow that references it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_new = dprep.Dataflow.reference(dprep.ExternalReference(dflow_path))\n",
|
||||
"dflow_new.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When executed, the new Dataflow returns the same results as the one we saved to the .dprep file. Since this reference is resolved on execution, updating the referenced Dataflow results in the changes being visible when re-executing the referencing Dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.take(5)\n",
|
||||
"dflow.save(dflow_path)\n",
|
||||
"\n",
|
||||
"dflow_new.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we can see, even though we did not modify `dflow_new`, it now returns only 5 records, as the referenced Dataflow was updated with the result from `dflow.take(5)`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# External References\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In addition to opening existing Dataflows in code and modifying them, it is also possible to create and persist Dataflows that reference another Dataflow that has been persisted to a .dprep file. In this case, executing this Dataflow will load and execute the referenced Dataflow dynamically, and then execute the steps in the referencing Dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To demonstrate, we will create a Dataflow that loads and transforms some data. After that, we will persist this Dataflow to disk. To learn more about saving and opening .dprep files, see: [Opening and Saving Dataflows](./open-save-dataflows.ipynb)"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"import tempfile\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"dflow = dprep.auto_read_file('../data/crime.txt')\n",
|
||||
"dflow = dflow.drop_errors(['Column7', 'Column8', 'Column9'], dprep.ColumnRelationship.ANY)\n",
|
||||
"dflow_path = os.path.join(tempfile.gettempdir(), 'package.dprep')\n",
|
||||
"dflow.save(dflow_path)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that we have a .dprep file, we can create a new Dataflow that references it."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_new = dprep.Dataflow.reference(dprep.ExternalReference(dflow_path))\n",
|
||||
"dflow_new.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When executed, the new Dataflow returns the same results as the one we saved to the .dprep file. Since this reference is resolved on execution, updating the referenced Dataflow results in the changes being visible when re-executing the referencing Dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.take(5)\n",
|
||||
"dflow.save(dflow_path)\n",
|
||||
"\n",
|
||||
"dflow_new.head(10)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we can see, even though we did not modify `dflow_new`, it now returns only 5 records, as the referenced Dataflow was updated with the result from `dflow.take(5)`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,220 +1,220 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Filtering\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to filter out columns or rows using `Dataflow.drop_columns` or `Dataflow.filter`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# initial set up\n",
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"from datetime import datetime\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Filtering columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To filter columns, use `Dataflow.drop_columns`. This method takes a list of columns to drop or a more complex argument called `ColumnSelector`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering columns with list of strings"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this example, `drop_columns` takes a list of strings. Each string should exactly match the desired column to drop."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.drop_columns(['ID', 'Location Description', 'Ward', 'Community Area', 'FBI Code'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering columns with regex"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Alternatively, a `ColumnSelector` can be used to drop columns that match a regex expression. In this example, we drop all the columns that match the expression `Column*|.*longitud|.*latitude`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.drop_columns(dprep.ColumnSelector('Column*|.*longitud|.*latitude', True, True))\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Filtering rows"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To filter rows, use `DataFlow.filter`. This method takes an `Expression` as an argument, and returns a new dataflow with the rows in which the expression evaluates to `True`. Expressions are built by indexing the `Dataflow` with a column name (`dataflow['myColumn']`) and regular operators (`>`, `<`, `>=`, `<=`, `==`, `!=`)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering rows with simple expressions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Index into the Dataflow specifying the column name as a string argument `dataflow['column_name']` and in combination with one of the following standard operators `>, <, >=, <=, ==, !=`, build an expression such as `dataflow['District'] > 9`. Finally, pass the built expression into the `Dataflow.filter` function.\n",
|
||||
"\n",
|
||||
"In this example, `dataflow.filter(dataflow['District'] > 9)` returns a new dataflow with the rows in which the value of \"District\" is greater than '10' \n",
|
||||
"\n",
|
||||
"*Note that \"District\" is first converted to numeric, which allows us to build an expression comparing it against other numeric values.*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.to_number(['District'])\n",
|
||||
"dflow = dflow.filter(dflow['District'] > 9)\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering rows with complex expressions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To filter using complex expressions, combine one or more simple expressions with the operators `&`, `|`, and `~`. Please note that the precedence of these operators is lower than that of the comparison operators; therefore, you'll need to use parentheses to group clauses together. \n",
|
||||
"\n",
|
||||
"In this example, `Dataflow.filter` returns a new dataflow with the rows in which \"Primary Type\" equals 'DECEPTIVE PRACTICE' and \"District\" is greater than or equal to '10'."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.to_number(['District'])\n",
|
||||
"dflow = dflow.filter((dflow['Primary Type'] == 'DECEPTIVE PRACTICE') & (dflow['District'] >= 10))\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is also possible to filter rows combining more than one expression builder to create a nested expression.\n",
|
||||
"\n",
|
||||
"*Note that `'Date'` and `'Updated On'` are first converted to datetime, which allows us to build an expression comparing it against other datetime values.*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.to_datetime(['Date', 'Updated On'], ['%Y-%m-%d %H:%M:%S'])\n",
|
||||
"dflow = dflow.to_number(['District', 'Y Coordinate'])\n",
|
||||
"comparison_date = datetime(2016,4,13)\n",
|
||||
"dflow = dflow.filter(\n",
|
||||
" ((dflow['Date'] > comparison_date) | (dflow['Updated On'] > comparison_date))\n",
|
||||
" | ((dflow['Y Coordinate'] > 1900000) & (dflow['District'] > 10.0)))\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Filtering\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to filter out columns or rows using `Dataflow.drop_columns` or `Dataflow.filter`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# initial set up\n",
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"from datetime import datetime\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Filtering columns"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To filter columns, use `Dataflow.drop_columns`. This method takes a list of columns to drop or a more complex argument called `ColumnSelector`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering columns with list of strings"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this example, `drop_columns` takes a list of strings. Each string should exactly match the desired column to drop."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.drop_columns(['ID', 'Location Description', 'Ward', 'Community Area', 'FBI Code'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering columns with regex"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Alternatively, a `ColumnSelector` can be used to drop columns that match a regex expression. In this example, we drop all the columns that match the expression `Column*|.*longitud|.*latitude`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.drop_columns(dprep.ColumnSelector('Column*|.*longitud|.*latitude', True, True))\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Filtering rows"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To filter rows, use `DataFlow.filter`. This method takes an `Expression` as an argument, and returns a new dataflow with the rows in which the expression evaluates to `True`. Expressions are built by indexing the `Dataflow` with a column name (`dataflow['myColumn']`) and regular operators (`>`, `<`, `>=`, `<=`, `==`, `!=`)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering rows with simple expressions"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Index into the Dataflow specifying the column name as a string argument `dataflow['column_name']` and in combination with one of the following standard operators `>, <, >=, <=, ==, !=`, build an expression such as `dataflow['District'] > 9`. Finally, pass the built expression into the `Dataflow.filter` function.\n",
|
||||
"\n",
|
||||
"In this example, `dataflow.filter(dataflow['District'] > 9)` returns a new dataflow with the rows in which the value of \"District\" is greater than '10' \n",
|
||||
"\n",
|
||||
"*Note that \"District\" is first converted to numeric, which allows us to build an expression comparing it against other numeric values.*"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.to_number(['District'])\n",
|
||||
"dflow = dflow.filter(dflow['District'] > 9)\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filtering rows with complex expressions"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To filter using complex expressions, combine one or more simple expressions with the operators `&`, `|`, and `~`. Please note that the precedence of these operators is lower than that of the comparison operators; therefore, you'll need to use parentheses to group clauses together. \n",
|
||||
"\n",
|
||||
"In this example, `Dataflow.filter` returns a new dataflow with the rows in which \"Primary Type\" equals 'DECEPTIVE PRACTICE' and \"District\" is greater than or equal to '10'."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.to_number(['District'])\n",
|
||||
"dflow = dflow.filter((dflow['Primary Type'] == 'DECEPTIVE PRACTICE') & (dflow['District'] >= 10))\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is also possible to filter rows combining more than one expression builder to create a nested expression.\n",
|
||||
"\n",
|
||||
"*Note that `'Date'` and `'Updated On'` are first converted to datetime, which allows us to build an expression comparing it against other datetime values.*"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.to_datetime(['Date', 'Updated On'], ['%Y-%m-%d %H:%M:%S'])\n",
|
||||
"dflow = dflow.to_number(['District', 'Y Coordinate'])\n",
|
||||
"comparison_date = datetime(2016,4,13)\n",
|
||||
"dflow = dflow.filter(\n",
|
||||
" ((dflow['Date'] > comparison_date) | (dflow['Updated On'] > comparison_date))\n",
|
||||
" | ((dflow['Y Coordinate'] > 1900000) & (dflow['District'] > 10.0)))\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,211 +1,211 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Fuzzy Grouping\n",
|
||||
"\n",
|
||||
"Unprepared data often represents the same entity with multiple values; examples include different spellings, varying capitalizations, and abbreviations. This is common when working with data gathered from multiple sources or through human input. One way to canonicalize and reconcile these variants is to use Data Prep's fuzzy_group_column (also known as \"text clustering\") functionality.\n",
|
||||
"\n",
|
||||
"Data Prep inspects a column to determine clusters of similar values. A new column is added in which clustered values are replaced with the canonical value of its cluster, thus significantly reducing the number of distinct values. You can control the degree of similarity required for values to be clustered together, override canonical form, and set clusters if automatic clustering did not provide the desired results.\n",
|
||||
"\n",
|
||||
"Let's explore the capabilities of `fuzzy_group_column` by first reading in a dataset and inspecting it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_json(path='../data/json.json')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see above, the column `inspections.business.city` contains several forms of the city name \"San Francisco\".\n",
|
||||
"Let's add a column with values replaced by the automatically detected canonical form. To do so call fuzzy_group_column() on an existing Dataflow:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_clean = dflow.fuzzy_group_column(source_column='inspections.business.city',\n",
|
||||
" new_column_name='city_grouped',\n",
|
||||
" similarity_threshold=0.8,\n",
|
||||
" similarity_score_column_name='similarity_score')\n",
|
||||
"dflow_clean.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The arguments `source_column` and `new_column_name` are required, whereas the others are optional.\n",
|
||||
"If `similarity_threshold` is provided, it will be used to control the required similarity level for the values to be grouped together.\n",
|
||||
"If `similarity_score_column_name` is provided, a second new column will be added to show similarity score between every pair of original and canonical values.\n",
|
||||
"\n",
|
||||
"In the resulting data set, you can see that all the different variations of representing \"San Francisco\" in the data were normalized to the same string, \"San Francisco\".\n",
|
||||
"\n",
|
||||
"But what if you want more control over what gets grouped, what doesn't, and what the canonical value should be? \n",
|
||||
"\n",
|
||||
"To get more control over grouping, canonical values, and exceptions, you need to use the `FuzzyGroupBuilder` class.\n",
|
||||
"Let's see what it has to offer below:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.fuzzy_group_column(source_column='inspections.business.city',\n",
|
||||
" new_column_name='city_grouped',\n",
|
||||
" similarity_threshold=0.8,\n",
|
||||
" similarity_score_column_name='similarity_score')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# calling learn() to get fuzzy groups\n",
|
||||
"builder.learn()\n",
|
||||
"builder.groups"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here you can see that `fuzzy_group_column` detected one group with four values that all map to \"San Francisco\" as the canonical value.\n",
|
||||
"You can see the effects of changing the similarity threshold next:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.similarity_threshold = 0.9\n",
|
||||
"builder.learn()\n",
|
||||
"builder.groups"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that you are using a similarity threshold of `0.9`, two distinct groups of values are generated.\n",
|
||||
"\n",
|
||||
"Let's tweak some of the detected groups before completing the builder and getting back the Dataflow with the resulting fuzzy grouped column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.similarity_threshold = 0.8\n",
|
||||
"builder.learn()\n",
|
||||
"groups = builder.groups\n",
|
||||
"groups"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# change the canonical value for the first group\n",
|
||||
"groups[0]['canonicalValue'] = 'SANFRAN'\n",
|
||||
"duplicates = groups[0]['duplicates']\n",
|
||||
"# remove the last duplicate value from the cluster\n",
|
||||
"duplicates = duplicates[:-1]\n",
|
||||
"# assign modified duplicate array back\n",
|
||||
"groups[0]['duplicates'] = duplicates\n",
|
||||
"# assign modified groups back to builder\n",
|
||||
"builder.groups = groups\n",
|
||||
"builder.groups"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here, the canonical value is modified to be used for the single fuzzy group and removed 'S.F.' from this group's duplicates list.\n",
|
||||
"\n",
|
||||
"You can mutate the copy of the `groups` list from the builder (be careful to keep the structure of objects inside this list). After getting the desired groups in the list, you can update the builder with it.\n",
|
||||
"\n",
|
||||
"Now you can get a dataflow with the FuzzyGroup step in it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_clean = builder.to_dataflow()\n",
|
||||
"\n",
|
||||
"df = dflow_clean.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Fuzzy Grouping\n",
|
||||
"\n",
|
||||
"Unprepared data often represents the same entity with multiple values; examples include different spellings, varying capitalizations, and abbreviations. This is common when working with data gathered from multiple sources or through human input. One way to canonicalize and reconcile these variants is to use Data Prep's fuzzy_group_column (also known as \"text clustering\") functionality.\n",
|
||||
"\n",
|
||||
"Data Prep inspects a column to determine clusters of similar values. A new column is added in which clustered values are replaced with the canonical value of its cluster, thus significantly reducing the number of distinct values. You can control the degree of similarity required for values to be clustered together, override canonical form, and set clusters if automatic clustering did not provide the desired results.\n",
|
||||
"\n",
|
||||
"Let's explore the capabilities of `fuzzy_group_column` by first reading in a dataset and inspecting it."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_json(path='../data/json.json')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see above, the column `inspections.business.city` contains several forms of the city name \"San Francisco\".\n",
|
||||
"Let's add a column with values replaced by the automatically detected canonical form. To do so call fuzzy_group_column() on an existing Dataflow:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_clean = dflow.fuzzy_group_column(source_column='inspections.business.city',\n",
|
||||
" new_column_name='city_grouped',\n",
|
||||
" similarity_threshold=0.8,\n",
|
||||
" similarity_score_column_name='similarity_score')\n",
|
||||
"dflow_clean.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The arguments `source_column` and `new_column_name` are required, whereas the others are optional.\n",
|
||||
"If `similarity_threshold` is provided, it will be used to control the required similarity level for the values to be grouped together.\n",
|
||||
"If `similarity_score_column_name` is provided, a second new column will be added to show similarity score between every pair of original and canonical values.\n",
|
||||
"\n",
|
||||
"In the resulting data set, you can see that all the different variations of representing \"San Francisco\" in the data were normalized to the same string, \"San Francisco\".\n",
|
||||
"\n",
|
||||
"But what if you want more control over what gets grouped, what doesn't, and what the canonical value should be? \n",
|
||||
"\n",
|
||||
"To get more control over grouping, canonical values, and exceptions, you need to use the `FuzzyGroupBuilder` class.\n",
|
||||
"Let's see what it has to offer below:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.fuzzy_group_column(source_column='inspections.business.city',\n",
|
||||
" new_column_name='city_grouped',\n",
|
||||
" similarity_threshold=0.8,\n",
|
||||
" similarity_score_column_name='similarity_score')"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# calling learn() to get fuzzy groups\n",
|
||||
"builder.learn()\n",
|
||||
"builder.groups"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here you can see that `fuzzy_group_column` detected one group with four values that all map to \"San Francisco\" as the canonical value.\n",
|
||||
"You can see the effects of changing the similarity threshold next:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.similarity_threshold = 0.9\n",
|
||||
"builder.learn()\n",
|
||||
"builder.groups"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that you are using a similarity threshold of `0.9`, two distinct groups of values are generated.\n",
|
||||
"\n",
|
||||
"Let's tweak some of the detected groups before completing the builder and getting back the Dataflow with the resulting fuzzy grouped column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.similarity_threshold = 0.8\n",
|
||||
"builder.learn()\n",
|
||||
"groups = builder.groups\n",
|
||||
"groups"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# change the canonical value for the first group\n",
|
||||
"groups[0]['canonicalValue'] = 'SANFRAN'\n",
|
||||
"duplicates = groups[0]['duplicates']\n",
|
||||
"# remove the last duplicate value from the cluster\n",
|
||||
"duplicates = duplicates[:-1]\n",
|
||||
"# assign modified duplicate array back\n",
|
||||
"groups[0]['duplicates'] = duplicates\n",
|
||||
"# assign modified groups back to builder\n",
|
||||
"builder.groups = groups\n",
|
||||
"builder.groups"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here, the canonical value is modified to be used for the single fuzzy group and removed 'S.F.' from this group's duplicates list.\n",
|
||||
"\n",
|
||||
"You can mutate the copy of the `groups` list from the builder (be careful to keep the structure of objects inside this list). After getting the desired groups in the list, you can update the builder with it.\n",
|
||||
"\n",
|
||||
"Now you can get a dataflow with the FuzzyGroup step in it."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_clean = builder.to_dataflow()\n",
|
||||
"\n",
|
||||
"df = dflow_clean.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,147 +1,147 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Impute missing values\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# loading input data\n",
|
||||
"dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n",
|
||||
"dflow = dflow.to_number(['Latitude', 'Longitude'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Firstly, let us quickly see check the `MEAN` value of _Latitude_ column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n",
|
||||
" summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n",
|
||||
" summary_column_name='Latitude_MEAN',\n",
|
||||
" summary_function=dprep.SummaryFunction.MEAN)])\n",
|
||||
"dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n",
|
||||
"dflow_mean.head(1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# impute with MEAN\n",
|
||||
"impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n",
|
||||
" impute_function=dprep.ReplaceValueFunction.MEAN)\n",
|
||||
"# impute with custom value 42\n",
|
||||
"impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n",
|
||||
" custom_impute_value=42)\n",
|
||||
"# get instance of ImputeMissingValuesBuilder\n",
|
||||
"impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n",
|
||||
" group_by_columns=['Arrest'])\n",
|
||||
"# call learn() to learn a fixed program to impute missing values\n",
|
||||
"impute_builder.learn()\n",
|
||||
"# call to_dataflow() to get a dataflow with impute step added\n",
|
||||
"dflow_imputed = impute_builder.to_dataflow()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# check impute result\n",
|
||||
"dflow_imputed.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Impute missing values\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# loading input data\n",
|
||||
"dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n",
|
||||
"dflow = dflow.to_number(['Latitude', 'Longitude'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Firstly, let us quickly see check the `MEAN` value of _Latitude_ column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n",
|
||||
" summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n",
|
||||
" summary_column_name='Latitude_MEAN',\n",
|
||||
" summary_function=dprep.SummaryFunction.MEAN)])\n",
|
||||
"dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n",
|
||||
"dflow_mean.head(1)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# impute with MEAN\n",
|
||||
"impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n",
|
||||
" impute_function=dprep.ReplaceValueFunction.MEAN)\n",
|
||||
"# impute with custom value 42\n",
|
||||
"impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n",
|
||||
" custom_impute_value=42)\n",
|
||||
"# get instance of ImputeMissingValuesBuilder\n",
|
||||
"impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n",
|
||||
" group_by_columns=['Arrest'])\n",
|
||||
"# call learn() to learn a fixed program to impute missing values\n",
|
||||
"impute_builder.learn()\n",
|
||||
"# call to_dataflow() to get a dataflow with impute step added\n",
|
||||
"dflow_imputed = impute_builder.to_dataflow()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# check impute result\n",
|
||||
"dflow_imputed.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,265 +1,265 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Join\n",
|
||||
"\n",
|
||||
"In Data Prep you can easily join two Dataflows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, get the left side of the data into a shape that is ready for the join."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# get the first Dataflow and derive desired key column\n",
|
||||
"dflow_left = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/BostonWeather.csv')\n",
|
||||
"dflow_left = dflow_left.derive_column_by_example(source_columns='DATE', new_column_name='date_timerange',\n",
|
||||
" example_data=[('11/11/2015 0:54', 'Nov 11, 2015 | 12AM-2AM'),\n",
|
||||
" ('2/1/2015 0:54', 'Feb 1, 2015 | 12AM-2AM'),\n",
|
||||
" ('1/29/2015 20:54', 'Jan 29, 2015 | 8PM-10PM')])\n",
|
||||
"dflow_left = dflow_left.drop_columns(['DATE'])\n",
|
||||
"\n",
|
||||
"# convert types and summarize data\n",
|
||||
"dflow_left = dflow_left.set_column_types(type_conversions={'HOURLYDRYBULBTEMPF': dprep.TypeConverter(dprep.FieldType.DECIMAL)})\n",
|
||||
"dflow_left = dflow_left.filter(expression=~dflow_left['HOURLYDRYBULBTEMPF'].is_error())\n",
|
||||
"dflow_left = dflow_left.summarize(group_by_columns=['date_timerange'],summary_columns=[dprep.SummaryColumnsValue('HOURLYDRYBULBTEMPF', dprep.api.engineapi.typedefinitions.SummaryFunction.MEAN, 'HOURLYDRYBULBTEMPF_Mean')] )\n",
|
||||
"\n",
|
||||
"# cache the result so the steps above are not executed every time we pull on the data\n",
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))\n",
|
||||
"dflow_left.cache(directory_path=cache_dir)\n",
|
||||
"dflow_left.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's prepare the data for the right side of the join."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# get the second Dataflow and desired key column\n",
|
||||
"dflow_right = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/bike-share/*-hubway-tripdata.csv')\n",
|
||||
"dflow_right = dflow_right.keep_columns(['starttime', 'start station id'])\n",
|
||||
"dflow_right = dflow_right.derive_column_by_example(source_columns='starttime', new_column_name='l_date_timerange',\n",
|
||||
" example_data=[('2015-01-01 00:21:44', 'Jan 1, 2015 | 12AM-2AM')])\n",
|
||||
"dflow_right = dflow_right.drop_columns('starttime')\n",
|
||||
"\n",
|
||||
"# cache the results\n",
|
||||
"dflow_right.cache(directory_path=cache_dir)\n",
|
||||
"dflow_right.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There are three ways you can join two Dataflows in Data Prep:\n",
|
||||
"1. Create a `JoinBuilder` object for interactive join configuration.\n",
|
||||
"2. Call ```join()``` on one of the Dataflows and pass in the other along with all other arguments.\n",
|
||||
"3. Call ```Dataflow.join()``` method and pass in two Dataflows along with all other arguments.\n",
|
||||
"\n",
|
||||
"We will explore the builder object as it simplifies the determination of correct arguments. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# construct a builder for joining dataflow_l with dataflow_r\n",
|
||||
"join_builder = dflow_left.builders.join(right_dataflow=dflow_right, left_column_prefix='l', right_column_prefix='r')\n",
|
||||
"\n",
|
||||
"join_builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"So far the builder has no properties set except default values.\n",
|
||||
"From here you can set each of the options and preview its effect on the join result or use Data Prep to determine some of them.\n",
|
||||
"\n",
|
||||
"Let's start with determining appropriate column prefixes for left and right side of the join and lists of columns that would not conflict and therefore don't need to be prefixed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"join_builder.detect_column_info()\n",
|
||||
"join_builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can see that Data Prep has performed a pull on both Dataflows to determine the column names in them. Given that `dataflow_r` already had a column starting with `l_` new prefix got generated which would not collide with any column names that are already present.\n",
|
||||
"Additionally columns in each Dataflow that won't conflict during join would remain unprefixed.\n",
|
||||
"This apprach to column naming is crucial for join robustness to schema changes in the data. Let's say that at some time in future the data consumed by left Dataflow will also have `l_date_timerange` column in it.\n",
|
||||
"Configured as above the join will still run as expected and the new column will be prefixed with `l2_` ensuring that ig column `l_date_timerange` was consumed by some other future transformation it remains unaffected.\n",
|
||||
"\n",
|
||||
"Note: `KEY_generated` is appended to both lists and is reserved for Data Prep use in case Autojoin is performed.\n",
|
||||
"\n",
|
||||
"### Autojoin\n",
|
||||
"Autojoin is a Data prep feature that determines suitable join arguments given data on both sides. In some cases Autojoin can even derive a key column from a number of available columns in the data.\n",
|
||||
"Here is how you can use Autojoin:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# generate join suggestions\n",
|
||||
"join_builder.generate_suggested_join()\n",
|
||||
"\n",
|
||||
"# list generated suggestions\n",
|
||||
"join_builder.list_join_suggestions()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's select the first suggestion and preview the result of the join."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# apply first suggestion\n",
|
||||
"join_builder.apply_suggestion(0)\n",
|
||||
"\n",
|
||||
"join_builder.preview(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now, get our new joined Dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_autojoined = join_builder.to_dataflow().drop_columns(['l_date_timerange'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Joining two Dataflows without pulling the data\n",
|
||||
"\n",
|
||||
"If you don't want to pull on data and know what join should look like, you can always use the join method on the Dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_joined = dprep.Dataflow.join(left_dataflow=dflow_left,\n",
|
||||
" right_dataflow=dflow_right,\n",
|
||||
" join_key_pairs=[('date_timerange', 'l_date_timerange')],\n",
|
||||
" left_column_prefix='l2_',\n",
|
||||
" right_column_prefix='r_')\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_joined.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_joined = dflow_joined.filter(expression=dflow_joined['r_start station id'] == '67')\n",
|
||||
"df = dflow_joined.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.5"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Join\n",
|
||||
"\n",
|
||||
"In Data Prep you can easily join two Dataflows."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, get the left side of the data into a shape that is ready for the join."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# get the first Dataflow and derive desired key column\n",
|
||||
"dflow_left = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/BostonWeather.csv')\n",
|
||||
"dflow_left = dflow_left.derive_column_by_example(source_columns='DATE', new_column_name='date_timerange',\n",
|
||||
" example_data=[('11/11/2015 0:54', 'Nov 11, 2015 | 12AM-2AM'),\n",
|
||||
" ('2/1/2015 0:54', 'Feb 1, 2015 | 12AM-2AM'),\n",
|
||||
" ('1/29/2015 20:54', 'Jan 29, 2015 | 8PM-10PM')])\n",
|
||||
"dflow_left = dflow_left.drop_columns(['DATE'])\n",
|
||||
"\n",
|
||||
"# convert types and summarize data\n",
|
||||
"dflow_left = dflow_left.set_column_types(type_conversions={'HOURLYDRYBULBTEMPF': dprep.TypeConverter(dprep.FieldType.DECIMAL)})\n",
|
||||
"dflow_left = dflow_left.filter(expression=~dflow_left['HOURLYDRYBULBTEMPF'].is_error())\n",
|
||||
"dflow_left = dflow_left.summarize(group_by_columns=['date_timerange'],summary_columns=[dprep.SummaryColumnsValue('HOURLYDRYBULBTEMPF', dprep.api.engineapi.typedefinitions.SummaryFunction.MEAN, 'HOURLYDRYBULBTEMPF_Mean')] )\n",
|
||||
"\n",
|
||||
"# cache the result so the steps above are not executed every time we pull on the data\n",
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))\n",
|
||||
"dflow_left.cache(directory_path=cache_dir)\n",
|
||||
"dflow_left.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's prepare the data for the right side of the join."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# get the second Dataflow and desired key column\n",
|
||||
"dflow_right = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/bike-share/*-hubway-tripdata.csv')\n",
|
||||
"dflow_right = dflow_right.keep_columns(['starttime', 'start station id'])\n",
|
||||
"dflow_right = dflow_right.derive_column_by_example(source_columns='starttime', new_column_name='l_date_timerange',\n",
|
||||
" example_data=[('2015-01-01 00:21:44', 'Jan 1, 2015 | 12AM-2AM')])\n",
|
||||
"dflow_right = dflow_right.drop_columns('starttime')\n",
|
||||
"\n",
|
||||
"# cache the results\n",
|
||||
"dflow_right.cache(directory_path=cache_dir)\n",
|
||||
"dflow_right.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There are three ways you can join two Dataflows in Data Prep:\n",
|
||||
"1. Create a `JoinBuilder` object for interactive join configuration.\n",
|
||||
"2. Call ```join()``` on one of the Dataflows and pass in the other along with all other arguments.\n",
|
||||
"3. Call ```Dataflow.join()``` method and pass in two Dataflows along with all other arguments.\n",
|
||||
"\n",
|
||||
"We will explore the builder object as it simplifies the determination of correct arguments. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# construct a builder for joining dataflow_l with dataflow_r\n",
|
||||
"join_builder = dflow_left.builders.join(right_dataflow=dflow_right, left_column_prefix='l', right_column_prefix='r')\n",
|
||||
"\n",
|
||||
"join_builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"So far the builder has no properties set except default values.\n",
|
||||
"From here you can set each of the options and preview its effect on the join result or use Data Prep to determine some of them.\n",
|
||||
"\n",
|
||||
"Let's start with determining appropriate column prefixes for left and right side of the join and lists of columns that would not conflict and therefore don't need to be prefixed."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"join_builder.detect_column_info()\n",
|
||||
"join_builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can see that Data Prep has performed a pull on both Dataflows to determine the column names in them. Given that `dataflow_r` already had a column starting with `l_` new prefix got generated which would not collide with any column names that are already present.\n",
|
||||
"Additionally columns in each Dataflow that won't conflict during join would remain unprefixed.\n",
|
||||
"This apprach to column naming is crucial for join robustness to schema changes in the data. Let's say that at some time in future the data consumed by left Dataflow will also have `l_date_timerange` column in it.\n",
|
||||
"Configured as above the join will still run as expected and the new column will be prefixed with `l2_` ensuring that ig column `l_date_timerange` was consumed by some other future transformation it remains unaffected.\n",
|
||||
"\n",
|
||||
"Note: `KEY_generated` is appended to both lists and is reserved for Data Prep use in case Autojoin is performed.\n",
|
||||
"\n",
|
||||
"### Autojoin\n",
|
||||
"Autojoin is a Data prep feature that determines suitable join arguments given data on both sides. In some cases Autojoin can even derive a key column from a number of available columns in the data.\n",
|
||||
"Here is how you can use Autojoin:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# generate join suggestions\n",
|
||||
"join_builder.generate_suggested_join()\n",
|
||||
"\n",
|
||||
"# list generated suggestions\n",
|
||||
"join_builder.list_join_suggestions()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's select the first suggestion and preview the result of the join."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# apply first suggestion\n",
|
||||
"join_builder.apply_suggestion(0)\n",
|
||||
"\n",
|
||||
"join_builder.preview(10)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now, get our new joined Dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_autojoined = join_builder.to_dataflow().drop_columns(['l_date_timerange'])"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Joining two Dataflows without pulling the data\n",
|
||||
"\n",
|
||||
"If you don't want to pull on data and know what join should look like, you can always use the join method on the Dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_joined = dprep.Dataflow.join(left_dataflow=dflow_left,\n",
|
||||
" right_dataflow=dflow_right,\n",
|
||||
" join_key_pairs=[('date_timerange', 'l_date_timerange')],\n",
|
||||
" left_column_prefix='l2_',\n",
|
||||
" right_column_prefix='r_')\n"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_joined.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_joined = dflow_joined.filter(expression=dflow_joined['r_start station id'] == '67')\n",
|
||||
"df = dflow_joined.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,168 +1,168 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Label Encoder\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep has the ability to encode labels with values between 0 and (number of classes - 1) using `label_encode`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"from datetime import datetime\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use `label_encode` from a Dataflow, simply specify the source column and the new column name. `label_encode` will figure out all the distinct values or classes in the source column, and it will return a new Dataflow with a new column containing the labels."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.label_encode(source_column='Primary Type', new_column_name='Primary Type Label')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To have more control over the encoded labels, create a builder with `dataflow.builders.label_encode`.\n",
|
||||
"The builder allows you to preview and modify the encoded labels before generating a new Dataflow with the results. \n",
|
||||
"To get started, create a builder object with `dataflow.builders.label_encode` specifying the source column and the new column name. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.label_encode(source_column='Location Description', new_column_name='Location Description Label')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To generate the encoded labels, call the `learn` method on the builder object:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.learn()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To check the result, access the generated labels through the property `encoded_labels`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.encoded_labels"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To modify the generated results, just assign a new value to `encoded_labels`. The following example adds a missing label not found in the sample data. `builder.encoded_labels` is saved into a variable `encoded_labels`, modified, and assigned back to `builder.encoded_labels`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"encoded_labels = builder.encoded_labels\n",
|
||||
"encoded_labels['TOWNHOUSE'] = 6\n",
|
||||
"\n",
|
||||
"builder.encoded_labels = encoded_labels\n",
|
||||
"builder.encoded_labels"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once the desired results are achieved, call `builder.to_dataflow` to get the new Dataflow with the encoded labels."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dataflow = builder.to_dataflow()\n",
|
||||
"dataflow.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Label Encoder\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data Prep has the ability to encode labels with values between 0 and (number of classes - 1) using `label_encode`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"from datetime import datetime\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use `label_encode` from a Dataflow, simply specify the source column and the new column name. `label_encode` will figure out all the distinct values or classes in the source column, and it will return a new Dataflow with a new column containing the labels."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.label_encode(source_column='Primary Type', new_column_name='Primary Type Label')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To have more control over the encoded labels, create a builder with `dataflow.builders.label_encode`.\n",
|
||||
"The builder allows you to preview and modify the encoded labels before generating a new Dataflow with the results. \n",
|
||||
"To get started, create a builder object with `dataflow.builders.label_encode` specifying the source column and the new column name. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.label_encode(source_column='Location Description', new_column_name='Location Description Label')"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To generate the encoded labels, call the `learn` method on the builder object:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.learn()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To check the result, access the generated labels through the property `encoded_labels`:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.encoded_labels"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To modify the generated results, just assign a new value to `encoded_labels`. The following example adds a missing label not found in the sample data. `builder.encoded_labels` is saved into a variable `encoded_labels`, modified, and assigned back to `builder.encoded_labels`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"encoded_labels = builder.encoded_labels\n",
|
||||
"encoded_labels['TOWNHOUSE'] = 6\n",
|
||||
"\n",
|
||||
"builder.encoded_labels = encoded_labels\n",
|
||||
"builder.encoded_labels"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once the desired results are achieved, call `builder.to_dataflow` to get the new Dataflow with the encoded labels."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dataflow = builder.to_dataflow()\n",
|
||||
"dataflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,239 +1,239 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Min-Max Scaler\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The min-max scaler scales all values in a column to a desired range (typically [0, 1]). This is also known as feature scaling or unity-based normalization. Min-max scaling is commonly used to normalize numeric columns in a data set for machine learning algorithms."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, load a data set containing information about crime in Chicago. Keep only a few columns."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.keep_columns(columns=['ID', 'District', 'FBI Code'])\n",
|
||||
"dflow = dflow.to_number(columns=['District', 'FBI Code'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using `get_profile()`, you can see the shape of the numeric columns such as the minimum, maximum, count, and number of error values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To apply min-max scaling, call the function `min_max_scaler` on the Dataflow and specify the column name. This will trigger a full data scan over the column to determine the min and max values and perform the scaling. Note that the min and max values of the column are preserved at this point. If the same dataflow steps are performed over a different dataset, the min-max scaler must be re-executed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_district = dflow.min_max_scale(column='District')\n",
|
||||
"dflow_district.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Look at the data profile to see that the \"District\" column is now scaled; the min is 0 and the max is 1. Any error values and missing values from the source column are preserved."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_district.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also specify a custom range for the scaling. Instead of [0, 1], let's choose [-10, 10]."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_district_range = dflow.min_max_scale(column='District', range_min=-10, range_max=10)\n",
|
||||
"dflow_district_range.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In some cases, you may want to manually provide the min and max of the data in the source column. For example, you may want to avoid a full data scan because the dataset is large and we already know the min and max. You can provide the known min and max to the `min_max_scaler` function. The column will be scaled using the provided values. For example, if you want to scale the `FBI Code` column with 6 (`data_min`) becoming 0 (`range_min`), the program will scan the data to get `data_max`, which will become 1 (`range_max`)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_fbi = dflow.min_max_scale(column='FBI Code', data_min=6)\n",
|
||||
"dflow_fbi.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a Min-Max Scaler builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For more flexibility when constructing the arguments for the min-max scaling, you can use a Min-Max Scaler builder."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.min_max_scale(column='District')\n",
|
||||
"builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Calling `builder.learn()` will trigger a full data scan to see what `data_min` and `data_max` are. You can choose whether to use these values or set custom values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.learn()\n",
|
||||
"builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you want to provide custom values for any of the arguments, you can update the builder object."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.range_max = 10\n",
|
||||
"builder.data_min = 6\n",
|
||||
"builder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When you are satisfied with the arguments, you will call `builder.to_dataflow()` to get the result. Note that the min and max values of the source column is preserved by the builder at this point. If you need to get the true `data_min` and `data_max` values again, you will need to set those arguments on the builder to `None` and then call `builder.learn()` again."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_builder = builder.to_dataflow()\n",
|
||||
"dflow_builder.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Min-Max Scaler\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The min-max scaler scales all values in a column to a desired range (typically [0, 1]). This is also known as feature scaling or unity-based normalization. Min-max scaling is commonly used to normalize numeric columns in a data set for machine learning algorithms."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, load a data set containing information about crime in Chicago. Keep only a few columns."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.keep_columns(columns=['ID', 'District', 'FBI Code'])\n",
|
||||
"dflow = dflow.to_number(columns=['District', 'FBI Code'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using `get_profile()`, you can see the shape of the numeric columns such as the minimum, maximum, count, and number of error values."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To apply min-max scaling, call the function `min_max_scaler` on the Dataflow and specify the column name. This will trigger a full data scan over the column to determine the min and max values and perform the scaling. Note that the min and max values of the column are preserved at this point. If the same dataflow steps are performed over a different dataset, the min-max scaler must be re-executed."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_district = dflow.min_max_scale(column='District')\n",
|
||||
"dflow_district.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Look at the data profile to see that the \"District\" column is now scaled; the min is 0 and the max is 1. Any error values and missing values from the source column are preserved."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_district.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also specify a custom range for the scaling. Instead of [0, 1], let's choose [-10, 10]."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_district_range = dflow.min_max_scale(column='District', range_min=-10, range_max=10)\n",
|
||||
"dflow_district_range.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In some cases, you may want to manually provide the min and max of the data in the source column. For example, you may want to avoid a full data scan because the dataset is large and we already know the min and max. You can provide the known min and max to the `min_max_scaler` function. The column will be scaled using the provided values. For example, if you want to scale the `FBI Code` column with 6 (`data_min`) becoming 0 (`range_min`), the program will scan the data to get `data_max`, which will become 1 (`range_max`)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_fbi = dflow.min_max_scale(column='FBI Code', data_min=6)\n",
|
||||
"dflow_fbi.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a Min-Max Scaler builder"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For more flexibility when constructing the arguments for the min-max scaling, you can use a Min-Max Scaler builder."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.min_max_scale(column='District')\n",
|
||||
"builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Calling `builder.learn()` will trigger a full data scan to see what `data_min` and `data_max` are. You can choose whether to use these values or set custom values."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.learn()\n",
|
||||
"builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you want to provide custom values for any of the arguments, you can update the builder object."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.range_max = 10\n",
|
||||
"builder.data_min = 6\n",
|
||||
"builder"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When you are satisfied with the arguments, you will call `builder.to_dataflow()` to get the result. Note that the min and max values of the source column is preserved by the builder at this point. If you need to get the true `data_min` and `data_max` values again, you will need to set those arguments on the builder to `None` and then call `builder.learn()` again."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_builder = builder.to_dataflow()\n",
|
||||
"dflow_builder.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,179 +1,179 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# One Hot Encoder\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to perform one hot encoding on a selected column using `one_hot_encode`. The result Dataflow will have a new binary column for each categorical label encountered in the selected column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use `one_hot_encode` from a Dataflow, simply specify the source column. `one_hot_encode` will figure out all the distinct values or categorical labels in the source column using the current data, and it will return a new Dataflow with a new binary column for each categorical label. Note that the categorical labels are remembered in the Dataflow step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_result = dflow.one_hot_encode(source_column='Location Description')\n",
|
||||
"dflow_result.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"By default, all the new columns will use the `source_column` name as a prefix. However, if you would like to specify your own prefix, simply pass a `prefix` string as a second parameter."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_result = dflow.one_hot_encode(source_column='Location Description', prefix='LOCATION_')\n",
|
||||
"dflow_result.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To have more control over the categorical labels, create a builder using `dataflow.builders.one_hot_encode`. The builder allows to preview and modify the categorical labels before generating a new Dataflow with the results."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.one_hot_encode(source_column='Location Description', prefix='LOCATION_')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To generate the categorical labels, call the `learn` method on the builder object:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.learn()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To preview the categorical labels, simply access them through the property `categorical_labels` on the builder object:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.categorical_labels"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To modify the generated `categorical_labels`, assign a new value to `categorical_labels` or modify the existing one. The following example adds a missing label not found on the sample data to `categorical_labels`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.categorical_labels.append('TOWNHOUSE')\n",
|
||||
"builder.categorical_labels"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once the desired results are achieved, call `builder.to_dataflow` to get the new Dataflow with the encoded labels."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_result = builder.to_dataflow()\n",
|
||||
"dflow_result.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# One Hot Encoder\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to perform one hot encoding on a selected column using `one_hot_encode`. The result Dataflow will have a new binary column for each categorical label encountered in the selected column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use `one_hot_encode` from a Dataflow, simply specify the source column. `one_hot_encode` will figure out all the distinct values or categorical labels in the source column using the current data, and it will return a new Dataflow with a new binary column for each categorical label. Note that the categorical labels are remembered in the Dataflow step."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_result = dflow.one_hot_encode(source_column='Location Description')\n",
|
||||
"dflow_result.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"By default, all the new columns will use the `source_column` name as a prefix. However, if you would like to specify your own prefix, simply pass a `prefix` string as a second parameter."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_result = dflow.one_hot_encode(source_column='Location Description', prefix='LOCATION_')\n",
|
||||
"dflow_result.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To have more control over the categorical labels, create a builder using `dataflow.builders.one_hot_encode`. The builder allows to preview and modify the categorical labels before generating a new Dataflow with the results."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.one_hot_encode(source_column='Location Description', prefix='LOCATION_')"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To generate the categorical labels, call the `learn` method on the builder object:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.learn()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To preview the categorical labels, simply access them through the property `categorical_labels` on the builder object:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.categorical_labels"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To modify the generated `categorical_labels`, assign a new value to `categorical_labels` or modify the existing one. The following example adds a missing label not found on the sample data to `categorical_labels`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.categorical_labels.append('TOWNHOUSE')\n",
|
||||
"builder.categorical_labels"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once the desired results are achieved, call `builder.to_dataflow` to get the new Dataflow with the encoded labels."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_result = builder.to_dataflow()\n",
|
||||
"dflow_result.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,171 +1,171 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Opening and Saving Dataflows\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once you have built a Dataflow, you can save it to a `.dprep` file. This persists all of the information in your Dataflow including steps you've added, examples and programs from by-example steps, computed aggregations, etc.\n",
|
||||
"\n",
|
||||
"You can also open `.dprep` files to access any Dataflows you have previously persisted."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Open\n",
|
||||
"\n",
|
||||
"Use the `open()` method of the Dataflow class to load existing `.dprep` files."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"dflow_path = os.path.join(os.getcwd(), '..', 'data', 'crime.dprep')\n",
|
||||
"print(dflow_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import Dataflow"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = Dataflow.open(dflow_path)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Edit\n",
|
||||
"\n",
|
||||
"After a Dataflow is loaded, it can be further edited as needed. In this example, a filter is added."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.dataprep import col"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.filter(col('Description') != 'SIMPLE')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Save\n",
|
||||
"\n",
|
||||
"Use the `save()` method of the Dataflow class to write out the `.dprep` file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import tempfile\n",
|
||||
"temp_dir = tempfile._get_default_tempdir()\n",
|
||||
"temp_file_name = next(tempfile._get_candidate_names())\n",
|
||||
"temp_dflow_path = os.path.join(temp_dir, temp_file_name + '.dprep')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.save(temp_dflow_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Round-trip\n",
|
||||
"\n",
|
||||
"This illustrates the ability to load the edited Dataflow back in and use it, in this case to get a pandas DataFrame."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_to_open = Dataflow.open(temp_dflow_path)\n",
|
||||
"df = dflow_to_open.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"if os.path.isfile(temp_dflow_path):\n",
|
||||
" os.remove(temp_dflow_path)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
]
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Opening and Saving Dataflows\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once you have built a Dataflow, you can save it to a `.dprep` file. This persists all of the information in your Dataflow including steps you've added, examples and programs from by-example steps, computed aggregations, etc.\n",
|
||||
"\n",
|
||||
"You can also open `.dprep` files to access any Dataflows you have previously persisted."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Open\n",
|
||||
"\n",
|
||||
"Use the `open()` method of the Dataflow class to load existing `.dprep` files."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import os\n",
|
||||
"dflow_path = os.path.join(os.getcwd(), '..', 'data', 'crime.dprep')\n",
|
||||
"print(dflow_path)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import Dataflow"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = Dataflow.open(dflow_path)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Edit\n",
|
||||
"\n",
|
||||
"After a Dataflow is loaded, it can be further edited as needed. In this example, a filter is added."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from azureml.dataprep import col"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.filter(col('Description') != 'SIMPLE')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Save\n",
|
||||
"\n",
|
||||
"Use the `save()` method of the Dataflow class to write out the `.dprep` file."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import tempfile\n",
|
||||
"temp_dir = tempfile._get_default_tempdir()\n",
|
||||
"temp_file_name = next(tempfile._get_candidate_names())\n",
|
||||
"temp_dflow_path = os.path.join(temp_dir, temp_file_name + '.dprep')"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.save(temp_dflow_path)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Round-trip\n",
|
||||
"\n",
|
||||
"This illustrates the ability to load the edited Dataflow back in and use it, in this case to get a pandas DataFrame."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_to_open = Dataflow.open(temp_dflow_path)\n",
|
||||
"df = dflow_to_open.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"if os.path.isfile(temp_dflow_path):\n",
|
||||
" os.remove(temp_dflow_path)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,91 +1,91 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Quantile Transformation\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"DataPrep has the ability to perform quantile transformation to a numeric column. This transformation can transform the data into a normal or uniform distribution. Values bigger than the learnt boundaries will simply be clipped to the learnt boundaries when applying quantile transformation.\n",
|
||||
"\n",
|
||||
"Let's load a sample of the median income of california households in different suburbs from the 1990 census data. From the data profile, we can see that the minimum value and maximum value is 0.9946 and 15 respectively."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv(path='../data/median_income.csv').set_column_types(type_conversions={\n",
|
||||
" 'median_income': dprep.TypeConverter(dprep.FieldType.DECIMAL)\n",
|
||||
"})\n",
|
||||
"dflow.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's now apply quantile transformation to `median_income` and see how that affects the data. We will apply quantile transformation twice, one that maps the data to a Uniform(0, 1) distribution, one that maps it to a Normal(0, 1) distribution.\n",
|
||||
"\n",
|
||||
"From the data profile, we can see that the min and max of the uniform median income is strictly between 0 and 1 and the mean and standard deviation of the normal median income is close to 0 and 1 respectively.\n",
|
||||
"\n",
|
||||
"*Note: for normal distribution, we will clip the values at the ends as the 0th percentile and the 100th percentile are -Inf and Inf respectively.*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_uniform', quantiles_count=5)\n",
|
||||
"dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_normal', \n",
|
||||
" quantiles_count=5, output_distribution=\"Normal\")\n",
|
||||
"dflow.get_profile()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Quantile Transformation\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"DataPrep has the ability to perform quantile transformation to a numeric column. This transformation can transform the data into a normal or uniform distribution. Values bigger than the learnt boundaries will simply be clipped to the learnt boundaries when applying quantile transformation.\n",
|
||||
"\n",
|
||||
"Let's load a sample of the median income of california households in different suburbs from the 1990 census data. From the data profile, we can see that the minimum value and maximum value is 0.9946 and 15 respectively."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv(path='../data/median_income.csv').set_column_types(type_conversions={\n",
|
||||
" 'median_income': dprep.TypeConverter(dprep.FieldType.DECIMAL)\n",
|
||||
"})\n",
|
||||
"dflow.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's now apply quantile transformation to `median_income` and see how that affects the data. We will apply quantile transformation twice, one that maps the data to a Uniform(0, 1) distribution, one that maps it to a Normal(0, 1) distribution.\n",
|
||||
"\n",
|
||||
"From the data profile, we can see that the min and max of the uniform median income is strictly between 0 and 1 and the mean and standard deviation of the normal median income is close to 0 and 1 respectively.\n",
|
||||
"\n",
|
||||
"*Note: for normal distribution, we will clip the values at the ends as the 0th percentile and the 100th percentile are -Inf and Inf respectively.*"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_uniform', quantiles_count=5)\n",
|
||||
"dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_normal', \n",
|
||||
" quantiles_count=5, output_distribution=\"Normal\")\n",
|
||||
"dflow.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,145 +1,145 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Random Split\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep provides the functionality of splitting a data set into two. When training a machine learning model, it is often desirable to train the model on a subset of data, then validate the model on a different subset."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `random_split(percentage, seed=None)` function in Data Prep takes in a Dataflow and randomly splitting it into two distinct subsets (approximately by the percentage specified)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `random_split` will receive different seeds."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To demonstrate, you can go through the following example. First, you can read the first 10,000 lines from a file. Since the contents of the file don't matter, just the first two columns can be used for a simple example."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv').take(10000)\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Date'])\n",
|
||||
"profile = dflow.get_profile()\n",
|
||||
"print('Row count: %d' % (profile.columns['ID'].count))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, you can call `random_split` with the percentage set to 10% (the actual split ratio will be an approximation of `percentage`). You can take a look at the row count of the first returned Dataflow. You should see that `dflow_test` has approximately 1,000 rows (10% of 10,000)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"(dflow_test, dflow_train) = dflow.random_split(percentage=0.1)\n",
|
||||
"profile_test = dflow_test.get_profile()\n",
|
||||
"print('Row count of \"test\": %d' % (profile_test.columns['ID'].count))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you can take a look at the row count of the second returned Dataflow. The row count of `dflow_test` and `dflow_train` sums exactly to 10,000, because `random_split` results in two subsets that make up the original Dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"profile_train = dflow_train.get_profile()\n",
|
||||
"print('Row count of \"train\": %d' % (profile_train.columns['ID'].count))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To specify a fixed seed, simply provide it to the `random_split` function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"(dflow_test, dflow_train) = dflow.random_split(percentage=0.1, seed=12345)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Random Split\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep provides the functionality of splitting a data set into two. When training a machine learning model, it is often desirable to train the model on a subset of data, then validate the model on a different subset."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `random_split(percentage, seed=None)` function in Data Prep takes in a Dataflow and randomly splitting it into two distinct subsets (approximately by the percentage specified)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `random_split` will receive different seeds."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To demonstrate, you can go through the following example. First, you can read the first 10,000 lines from a file. Since the contents of the file don't matter, just the first two columns can be used for a simple example."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv').take(10000)\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Date'])\n",
|
||||
"profile = dflow.get_profile()\n",
|
||||
"print('Row count: %d' % (profile.columns['ID'].count))"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, you can call `random_split` with the percentage set to 10% (the actual split ratio will be an approximation of `percentage`). You can take a look at the row count of the first returned Dataflow. You should see that `dflow_test` has approximately 1,000 rows (10% of 10,000)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"(dflow_test, dflow_train) = dflow.random_split(percentage=0.1)\n",
|
||||
"profile_test = dflow_test.get_profile()\n",
|
||||
"print('Row count of \"test\": %d' % (profile_test.columns['ID'].count))"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you can take a look at the row count of the second returned Dataflow. The row count of `dflow_test` and `dflow_train` sums exactly to 10,000, because `random_split` results in two subsets that make up the original Dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"profile_train = dflow_train.get_profile()\n",
|
||||
"print('Row count of \"train\": %d' % (profile_train.columns['ID'].count))"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To specify a fixed seed, simply provide it to the `random_split` function."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"(dflow_test, dflow_train) = dflow.random_split(percentage=0.1, seed=12345)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,130 +1,130 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Replace DataSource Reference\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A common practice when performing DataPrep is to build up a script or set of cleaning operations on a smaller example file locally. This is quicker and easier than dealing with large amounts of data initially.\n",
|
||||
"\n",
|
||||
"After building a Dataflow that performs the desired steps, it's time to run it against the larger dataset, which may be stored in the cloud, or even locally just in a different file. This is where we can use `Dataflow.replace_datasource` to get a Dataflow identical to the one built on the small data, but referencing the newly specified DataSource."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we have the first 10 rows of a dataset called 'Crime'. The original dataset is over 100MB (admittedly not that large of a dataset but this is just an example).\n",
|
||||
"\n",
|
||||
"We'll perform a few cleaning operations."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_dropped = dflow.drop_columns(['Location', 'Updated On', 'X Coordinate', 'Y Coordinate', 'Description'])\n",
|
||||
"sctb = dflow_dropped.builders.set_column_types()\n",
|
||||
"sctb.learn(inference_arguments=dprep.InferenceArguments(day_first=False))\n",
|
||||
"dflow_typed = sctb.to_dataflow()\n",
|
||||
"dflow_typed.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that we have a Dataflow with all our desired steps, we're ready to run against the 'full' dataset stored in Azure Blob.\n",
|
||||
"All we need to do is pass the BlobDataSource into `replace_datasource` and we'll get back an identical Dataflow with the new DataSource substituted in."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_replaced = dflow_typed.replace_datasource(dprep.BlobDataSource('https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"'replaced_dflow' will now pull data from the 168MB (729734 rows) version of Crime0.csv stored in Azure Blob!\n",
|
||||
"\n",
|
||||
"NOTE: Dataflows can also be created by referencing a different Dataflow. Instead of using `replace_datasource`, there is a corresponding `replace_reference` method.\n",
|
||||
"\n",
|
||||
"We should be careful now since pulling all that data down and putting it in a pandas dataframe isn't an ideal way to inspect the result of our Dataflow. So instead, to see that our steps are being applied to all the new data, we can add a `take_sample` step, which will select records at random (based on a given probability) to be returned.\n",
|
||||
"\n",
|
||||
"The probability below takes the ~730000 rows down to a more inspectable ~73, though the number will vary each time `to_pandas_dataframe()` is run, since they are being randomly selected based on the probability."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_random_sample= dflow_replaced.take_sample(probability=0.0001)\n",
|
||||
"sample = dflow_random_sample.to_pandas_dataframe()\n",
|
||||
"sample"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Replace DataSource Reference\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A common practice when performing DataPrep is to build up a script or set of cleaning operations on a smaller example file locally. This is quicker and easier than dealing with large amounts of data initially.\n",
|
||||
"\n",
|
||||
"After building a Dataflow that performs the desired steps, it's time to run it against the larger dataset, which may be stored in the cloud, or even locally just in a different file. This is where we can use `Dataflow.replace_datasource` to get a Dataflow identical to the one built on the small data, but referencing the newly specified DataSource."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
|
||||
"df = dflow.to_pandas_dataframe()\n",
|
||||
"df"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we have the first 10 rows of a dataset called 'Crime'. The original dataset is over 100MB (admittedly not that large of a dataset but this is just an example).\n",
|
||||
"\n",
|
||||
"We'll perform a few cleaning operations."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_dropped = dflow.drop_columns(['Location', 'Updated On', 'X Coordinate', 'Y Coordinate', 'Description'])\n",
|
||||
"sctb = dflow_dropped.builders.set_column_types()\n",
|
||||
"sctb.learn(inference_arguments=dprep.InferenceArguments(day_first=False))\n",
|
||||
"dflow_typed = sctb.to_dataflow()\n",
|
||||
"dflow_typed.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that we have a Dataflow with all our desired steps, we're ready to run against the 'full' dataset stored in Azure Blob.\n",
|
||||
"All we need to do is pass the BlobDataSource into `replace_datasource` and we'll get back an identical Dataflow with the new DataSource substituted in."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_replaced = dflow_typed.replace_datasource(dprep.BlobDataSource('https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv'))"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"'replaced_dflow' will now pull data from the 168MB (729734 rows) version of Crime0.csv stored in Azure Blob!\n",
|
||||
"\n",
|
||||
"NOTE: Dataflows can also be created by referencing a different Dataflow. Instead of using `replace_datasource`, there is a corresponding `replace_reference` method.\n",
|
||||
"\n",
|
||||
"We should be careful now since pulling all that data down and putting it in a pandas dataframe isn't an ideal way to inspect the result of our Dataflow. So instead, to see that our steps are being applied to all the new data, we can add a `take_sample` step, which will select records at random (based on a given probability) to be returned.\n",
|
||||
"\n",
|
||||
"The probability below takes the ~730000 rows down to a more inspectable ~73, though the number will vary each time `to_pandas_dataframe()` is run, since they are being randomly selected based on the probability."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_random_sample= dflow_replaced.take_sample(probability=0.0001)\n",
|
||||
"sample = dflow_random_sample.to_pandas_dataframe()\n",
|
||||
"sample"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,239 +1,239 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Replace, Fill, Error\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use the methods in this notebook to change values in your dataset.\n",
|
||||
"\n",
|
||||
"* <a href='#replace'>replace</a> - use this method to replace a value with another value. You can also use this to replace null with a value, or a value with null\n",
|
||||
"* <a href='#error'>error</a> - use this method to replace a value with an error.\n",
|
||||
"* <a href='#fill_nulls'>fill_nulls</a> - this method lets you fill all nulls in a column with a certain value.\n",
|
||||
"* <a href='#fill_errors'>fill_errors</a> - this method lets you fill all errors in a column with a certain value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.to_datetime('Date', ['%m/%d/%Y %H:%M'])\n",
|
||||
"dflow = dflow.to_number(['IUCR', 'District', 'FBI Code'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Replace <a id='replace'></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### String\n",
|
||||
"Use `replace` to swap a string value with another string value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.replace('Primary Type', 'THEFT', 'STOLEN')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Use `replace` to remove a certain string value from the column, replacing it with null. Note that Pandas shows null values as None."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.replace('Primary Type', 'DECEPTIVE PRACTICE', None)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Numeric\n",
|
||||
"Use `replace` to swap a numeric value with another numeric value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.replace('District', 5, 1)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Date\n",
|
||||
"Use `replace` to swap in a new Date for an existing Date in the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime, timezone\n",
|
||||
"dflow = dflow.replace('Date', \n",
|
||||
" datetime(2016, 4, 15, 9, 0, tzinfo=timezone.utc), \n",
|
||||
" datetime(2018, 7, 4, 0, 0, tzinfo=timezone.utc))\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Error <a id='error'></a>\n",
|
||||
"\n",
|
||||
"The `error` method lets you create Error values. You can pass to this function the value that you want to find, along with the Error code to use in any Errors created."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.error('IUCR', 890, 'Invalid value')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Fill Nulls <a id='fill_nulls'></a>\n",
|
||||
"\n",
|
||||
"Use the `fill_nulls` method to replace all null values in columns with another value. This is similar to Panda's fillna() method."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.fill_nulls('Primary Type', 'N/A')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Fill Errors <a id='fill_errors'></a>\n",
|
||||
"\n",
|
||||
"Use the `fill_errors` method to replace all error values in columns with another value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.fill_errors('IUCR', -1)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Replace, Fill, Error\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use the methods in this notebook to change values in your dataset.\n",
|
||||
"\n",
|
||||
"* <a href='#replace'>replace</a> - use this method to replace a value with another value. You can also use this to replace null with a value, or a value with null\n",
|
||||
"* <a href='#error'>error</a> - use this method to replace a value with an error.\n",
|
||||
"* <a href='#fill_nulls'>fill_nulls</a> - this method lets you fill all nulls in a column with a certain value.\n",
|
||||
"* <a href='#fill_errors'>fill_errors</a> - this method lets you fill all errors in a column with a certain value."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.to_datetime('Date', ['%m/%d/%Y %H:%M'])\n",
|
||||
"dflow = dflow.to_number(['IUCR', 'District', 'FBI Code'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Replace <a id='replace'></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### String\n",
|
||||
"Use `replace` to swap a string value with another string value."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.replace('Primary Type', 'THEFT', 'STOLEN')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Use `replace` to remove a certain string value from the column, replacing it with null. Note that Pandas shows null values as None."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.replace('Primary Type', 'DECEPTIVE PRACTICE', None)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Numeric\n",
|
||||
"Use `replace` to swap a numeric value with another numeric value."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.replace('District', 5, 1)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Date\n",
|
||||
"Use `replace` to swap in a new Date for an existing Date in the data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from datetime import datetime, timezone\n",
|
||||
"dflow = dflow.replace('Date', \n",
|
||||
" datetime(2016, 4, 15, 9, 0, tzinfo=timezone.utc), \n",
|
||||
" datetime(2018, 7, 4, 0, 0, tzinfo=timezone.utc))\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Error <a id='error'></a>\n",
|
||||
"\n",
|
||||
"The `error` method lets you create Error values. You can pass to this function the value that you want to find, along with the Error code to use in any Errors created."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.error('IUCR', 890, 'Invalid value')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Fill Nulls <a id='fill_nulls'></a>\n",
|
||||
"\n",
|
||||
"Use the `fill_nulls` method to replace all null values in columns with another value. This is similar to Panda's fillna() method."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.fill_nulls('Primary Type', 'N/A')\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Fill Errors <a id='fill_errors'></a>\n",
|
||||
"\n",
|
||||
"Use the `fill_errors` method to replace all error values in columns with another value."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.fill_errors('IUCR', -1)\n",
|
||||
"head = dflow.head(5)\n",
|
||||
"head"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,140 +1,140 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Providing Secrets\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Currently, secrets are only persisted for the lifetime of the engine process. Even if the dataflow is saved to a file, the secrets are not persisted in the dprep file. If you started a new session (i.e. start a new engine process), loaded a dataflow and wanted to run it, you will need to call `use_secrets` to register the required secrets to use during execution, otherwise the execution will fail as the required secrets are not available.\n",
|
||||
"\n",
|
||||
"In this notebook, we will:\n",
|
||||
"1. Loading a previously saved dataflow\n",
|
||||
"2. Call `get_missing_secrets` to determine the missing secrets\n",
|
||||
"3. Call `use_secrets` and pass in the missing secrets to register it with the engine for this session\n",
|
||||
"4. Call `head` to see the a preview of the data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"import os"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's load the previously saved dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.Dataflow.open(file_path='../data/secrets.dprep')\n",
|
||||
"dflow"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can call `get_missing_secrets` to see which required secrets are missing in the engine."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.get_missing_secrets()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can now read the secrets from an environment variable, put it in a secret dictionary, and call `use_secrets` with the secrets. This will register the secrets in the engine so you don't need to provide them again in this session.\n",
|
||||
"\n",
|
||||
"_Note: It is a bad practice to have secrets in files that will be checked into source control._"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sas = os.environ['SCENARIOS_SECRETS']\n",
|
||||
"secrets = {\n",
|
||||
" 'https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv': sas\n",
|
||||
"}\n",
|
||||
"dflow.use_secrets(secrets=secrets)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can now call `head` without passing in `secrets` and the engine will successfully execute. Here is a preview of the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Providing Secrets\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Currently, secrets are only persisted for the lifetime of the engine process. Even if the dataflow is saved to a file, the secrets are not persisted in the dprep file. If you started a new session (i.e. start a new engine process), loaded a dataflow and wanted to run it, you will need to call `use_secrets` to register the required secrets to use during execution, otherwise the execution will fail as the required secrets are not available.\n",
|
||||
"\n",
|
||||
"In this notebook, we will:\n",
|
||||
"1. Loading a previously saved dataflow\n",
|
||||
"2. Call `get_missing_secrets` to determine the missing secrets\n",
|
||||
"3. Call `use_secrets` and pass in the missing secrets to register it with the engine for this session\n",
|
||||
"4. Call `head` to see the a preview of the data"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"import os"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's load the previously saved dataflow."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.Dataflow.open(file_path='../data/secrets.dprep')\n",
|
||||
"dflow"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can call `get_missing_secrets` to see which required secrets are missing in the engine."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.get_missing_secrets()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can now read the secrets from an environment variable, put it in a secret dictionary, and call `use_secrets` with the secrets. This will register the secrets in the engine so you don't need to provide them again in this session.\n",
|
||||
"\n",
|
||||
"_Note: It is a bad practice to have secrets in files that will be checked into source control._"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"sas = os.environ['SCENARIOS_SECRETS']\n",
|
||||
"secrets = {\n",
|
||||
" 'https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv': sas\n",
|
||||
"}\n",
|
||||
"dflow.use_secrets(secrets=secrets)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can now call `head` without passing in `secrets` and the engine will successfully execute. Here is a preview of the data."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,164 +1,164 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Semantic Types\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Some string values can be recognized as semantic types. For example, email addresses, US zip codes or IP addresses have specific formats that can be recognized, and then split in specific ways.\n",
|
||||
"\n",
|
||||
"When getting a DataProfile you can optionally ask to collect counts of values recognized as semantic types. [`Dataflow.get_profile()`](./data-profile.ipynb) executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile. Semantic type counts can be included in the data profile by calling `get_profile` with the `include_stype_counts` argument set to true.\n",
|
||||
"\n",
|
||||
"The `stype_counts` property of the DataProfile will then include entries for columns where some semantic types were recognized for some values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"dflow = dprep.read_json(path='../data/json.json')\n",
|
||||
"\n",
|
||||
"profile = dflow.get_profile(include_stype_counts=True)\n",
|
||||
"\n",
|
||||
"print(\"row count: \" + str(profile.row_count))\n",
|
||||
"profile.stype_counts"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To see all the supported semantic types, you can examine the `SType` enumeration. More types will be added over time."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"[t.name for t in dprep.SType]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can filter the found semantic types down to just those where all non-empty values matched. The `DataProfile.stype_counts` gives a list of semantic type counts for each column, where at least some matches were found. Those lists are in desecending order of count, so here we consider only the first in each list, as that will be the one with the highest count of values that match.\n",
|
||||
"\n",
|
||||
"In this example, the column `inspections.business.postal_code` looks to be a US zip code."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"stypes_counts = profile.stype_counts\n",
|
||||
"all_match = [\n",
|
||||
" (column, stypes_counts[column][0].stype)\n",
|
||||
" for column in stypes_counts\n",
|
||||
" if profile.row_count - profile.columns[column].empty_count == stypes_counts[column][0].count\n",
|
||||
"]\n",
|
||||
"all_match"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use semantic types to compute new columns. The new columns are the values split up into elements, or canonicalized.\n",
|
||||
"\n",
|
||||
"Here we reduce our data down to just the `postal` column so we can better see what a `split_stype` operation can do."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_postal = dflow.keep_columns(['inspections.business.postal_code']).rename_columns({'inspections.business.postal_code': 'postal'})\n",
|
||||
"dflow_postal.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With `SType.ZipCode`, values are split into their basic five digit zip code and the plus-four add-on of the Zip+4 format."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_split = dflow_postal.split_stype('postal', dprep.SType.ZIPCODE)\n",
|
||||
"dflow_split.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"`split_stype` also allows you to specify the fields of the stype to use and the name of the new columns. For example, if you just needed to strip the plus four from our zip codes, you could use this."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_no_plus4 = dflow_postal.split_stype('postal', dprep.SType.ZIPCODE, ['zip'], ['zipNoPlus4'])\n",
|
||||
"dflow_no_plus4.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Semantic Types\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Some string values can be recognized as semantic types. For example, email addresses, US zip codes or IP addresses have specific formats that can be recognized, and then split in specific ways.\n",
|
||||
"\n",
|
||||
"When getting a DataProfile you can optionally ask to collect counts of values recognized as semantic types. [`Dataflow.get_profile()`](./data-profile.ipynb) executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile. Semantic type counts can be included in the data profile by calling `get_profile` with the `include_stype_counts` argument set to true.\n",
|
||||
"\n",
|
||||
"The `stype_counts` property of the DataProfile will then include entries for columns where some semantic types were recognized for some values."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"dflow = dprep.read_json(path='../data/json.json')\n",
|
||||
"\n",
|
||||
"profile = dflow.get_profile(include_stype_counts=True)\n",
|
||||
"\n",
|
||||
"print(\"row count: \" + str(profile.row_count))\n",
|
||||
"profile.stype_counts"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To see all the supported semantic types, you can examine the `SType` enumeration. More types will be added over time."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"[t.name for t in dprep.SType]\n"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can filter the found semantic types down to just those where all non-empty values matched. The `DataProfile.stype_counts` gives a list of semantic type counts for each column, where at least some matches were found. Those lists are in desecending order of count, so here we consider only the first in each list, as that will be the one with the highest count of values that match.\n",
|
||||
"\n",
|
||||
"In this example, the column `inspections.business.postal_code` looks to be a US zip code."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"stypes_counts = profile.stype_counts\n",
|
||||
"all_match = [\n",
|
||||
" (column, stypes_counts[column][0].stype)\n",
|
||||
" for column in stypes_counts\n",
|
||||
" if profile.row_count - profile.columns[column].empty_count == stypes_counts[column][0].count\n",
|
||||
"]\n",
|
||||
"all_match"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use semantic types to compute new columns. The new columns are the values split up into elements, or canonicalized.\n",
|
||||
"\n",
|
||||
"Here we reduce our data down to just the `postal` column so we can better see what a `split_stype` operation can do."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_postal = dflow.keep_columns(['inspections.business.postal_code']).rename_columns({'inspections.business.postal_code': 'postal'})\n",
|
||||
"dflow_postal.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With `SType.ZipCode`, values are split into their basic five digit zip code and the plus-four add-on of the Zip+4 format."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_split = dflow_postal.split_stype('postal', dprep.SType.ZIPCODE)\n",
|
||||
"dflow_split.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"`split_stype` also allows you to specify the fields of the stype to use and the name of the new columns. For example, if you just needed to strip the plus four from our zip codes, you could use this."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_no_plus4 = dflow_postal.split_stype('postal', dprep.SType.ZIPCODE, ['zip'], ['zipNoPlus4'])\n",
|
||||
"dflow_no_plus4.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,220 +1,220 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Split column by example\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"DataPrep also offers you a way to easily split a column into multiple columns.\n",
|
||||
"The SplitColumnByExampleBuilder class lets you generate a proper split program that will work even when the cases are not trivial, like in example below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.read_lines(path='../data/crime.txt')\n",
|
||||
"df = dflow.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df['Line'].iloc[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see above, you can't split this particular file by space character as it will create too many columns.\n",
|
||||
"That's where split_column_by_example could be quite useful."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder = dflow.builders.split_column_by_example('Line', keep_delimiters=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.preview()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Couple things to take note of here. No examples were given, and yet DataPrep was able to generate quite reasonable split program. \n",
|
||||
"We have passed keep_delimiters=True so we can see all the data split into columns. In practice, though, delimiters are rarely useful, so let's exclude them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.keep_delimiters = False\n",
|
||||
"builder.preview()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This looks pretty good already, except that one case number is split into 2 columns. Taking the first row as an example, we want to keep case number as \"HY329907\" instead of \"HY\" and \"329907\" seperately. \n",
|
||||
"If we request generation of suggested examples we will get a list of examples that require input."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"suggestions = builder.generate_suggested_examples()\n",
|
||||
"suggestions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"suggestions.iloc[0]['Line']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Having retrieved source value we can now provide an example of desired split.\n",
|
||||
"Notice that we chose not to split date and time but rather keep them together in one column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.add_example(example=(suggestions['Line'].iloc[0], ['10140490','HY329907','7/5/2015 23:50','050XX N NEWLAND AVE','820','THEFT']))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.preview()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we can see from the preview, some of the crime types (`Line_6`) do not show up as expected. Let's try to add one more example. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"builder.add_example(example=(df['Line'].iloc[1],['10139776','HY329265','7/5/2015 23:30','011XX W MORSE AVE','460','BATTERY']))\n",
|
||||
"builder.preview()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This looks just like what we need. Let's get a dataflow with splited columns and drop original column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = builder.to_dataflow()\n",
|
||||
"dflow = dflow.drop_columns(['Line'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we have successfully split the data into useful columns through examples. "
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Split column by example\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"DataPrep also offers you a way to easily split a column into multiple columns.\n",
|
||||
"The SplitColumnByExampleBuilder class lets you generate a proper split program that will work even when the cases are not trivial, like in example below."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.read_lines(path='../data/crime.txt')\n",
|
||||
"df = dflow.head(10)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"df['Line'].iloc[0]"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can see above, you can't split this particular file by space character as it will create too many columns.\n",
|
||||
"That's where split_column_by_example could be quite useful."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder = dflow.builders.split_column_by_example('Line', keep_delimiters=True)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.preview()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Couple things to take note of here. No examples were given, and yet DataPrep was able to generate quite reasonable split program. \n",
|
||||
"We have passed keep_delimiters=True so we can see all the data split into columns. In practice, though, delimiters are rarely useful, so let's exclude them."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.keep_delimiters = False\n",
|
||||
"builder.preview()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This looks pretty good already, except that one case number is split into 2 columns. Taking the first row as an example, we want to keep case number as \"HY329907\" instead of \"HY\" and \"329907\" seperately. \n",
|
||||
"If we request generation of suggested examples we will get a list of examples that require input."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"suggestions = builder.generate_suggested_examples()\n",
|
||||
"suggestions"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"suggestions.iloc[0]['Line']"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Having retrieved source value we can now provide an example of desired split.\n",
|
||||
"Notice that we chose not to split date and time but rather keep them together in one column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.add_example(example=(suggestions['Line'].iloc[0], ['10140490','HY329907','7/5/2015 23:50','050XX N NEWLAND AVE','820','THEFT']))"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.preview()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we can see from the preview, some of the crime types (`Line_6`) do not show up as expected. Let's try to add one more example. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"builder.add_example(example=(df['Line'].iloc[1],['10139776','HY329265','7/5/2015 23:30','011XX W MORSE AVE','460','BATTERY']))\n",
|
||||
"builder.preview()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This looks just like what we need. Let's get a dataflow with splited columns and drop original column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = builder.to_dataflow()\n",
|
||||
"dflow = dflow.drop_columns(['Line'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we have successfully split the data into useful columns through examples. "
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,217 +1,217 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Sampling and Subsetting\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once a Dataflow has been created, it is possible to act on only a subset of the records contained in it. This can help when working with very large datasets or when only a portion of the records is truly relevant."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Head\n",
|
||||
"\n",
|
||||
"The `head` method will take the number of records specified, run them through the transformations in the Dataflow, and then return the result as a Pandas dataframe."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv('../data/crime_duplicate_headers.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Take\n",
|
||||
"\n",
|
||||
"The `take` method adds a step to the Dataflow that will keep the number of records specified (counting from the beginning) and drop the rest. Unlike `head`, which does not modify the Dataflow, all operations applied on a Dataflow on which `take` has been applied will affect only the records kept."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_top_five = dflow.take(5)\n",
|
||||
"dflow_top_five.to_pandas_dataframe()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Skip\n",
|
||||
"\n",
|
||||
"It is also possible to skip a certain number of records in a Dataflow, such that transformations are only applied after a specific point. Depending on the underlying data source, a Dataflow with a `skip` step might still have to scan through the data in order to skip past the records."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_skip_top_one = dflow_top_five.skip(1)\n",
|
||||
"dflow_skip_top_one.to_pandas_dataframe()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Take Sample\n",
|
||||
"\n",
|
||||
"In addition to taking records from the top of the dataset, it's also possible to take a random sample of the dataset. This is done through the `take_sample(probability, seed=None)` method. This method will scan through all of the records available in the Dataflow and include them based on the probability specified. The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `take_sample` will receive different seeds."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_sampled = dflow.take_sample(0.1)\n",
|
||||
"dflow_sampled.to_pandas_dataframe()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"`skip`, `take`, and `take_sample` can all be combined. With this, we can achieve behaviors like getting a random 10% sample fo the middle N records of a dataset."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"seed = 1\n",
|
||||
"dflow_nested_sample = dflow.skip(1).take(5).take_sample(0.5, seed)\n",
|
||||
"dflow_nested_sample.to_pandas_dataframe()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Take Stratified Sample\n",
|
||||
"Besides sampling all by a probability, we also have stratified sampling, provided the strata and strata weights, the probability to sample each stratum with.\n",
|
||||
"This is done through the `take_stratified_sample(columns, fractions, seed=None)` method.\n",
|
||||
"For all records, we will group each record by the columns specified to stratify, and based on the stratum x weight information in `fractions`, include said record.\n",
|
||||
"\n",
|
||||
"Seed behavior is same as in `take_sample`.\n",
|
||||
"\n",
|
||||
"If a stratum is not specified or the record cannot be grouped by said stratum, we default the weight to sample by to 0 (it will not be included).\n",
|
||||
"\n",
|
||||
"The order of `fractions` must match the order of `columns`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fractions = {}\n",
|
||||
"fractions[('ASSAULT',)] = 0.5\n",
|
||||
"fractions[('BATTERY',)] = 0.2\n",
|
||||
"fractions[('ARSON',)] = 0.5\n",
|
||||
"fractions[('THEFT',)] = 1.0\n",
|
||||
"\n",
|
||||
"columns = ['Primary Type']\n",
|
||||
"\n",
|
||||
"single_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n",
|
||||
"single_strata_sample.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Stratified sampling on multiple columns is also supported."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fractions = {}\n",
|
||||
"fractions[('ASSAULT', '560')] = 0.5\n",
|
||||
"fractions[('BATTERY', '460')] = 0.2\n",
|
||||
"fractions[('ARSON', '1020')] = 0.5\n",
|
||||
"fractions[('THEFT', '820')] = 1.0\n",
|
||||
"\n",
|
||||
"columns = ['Primary Type', 'IUCR']\n",
|
||||
"\n",
|
||||
"multi_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n",
|
||||
"multi_strata_sample.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Caching\n",
|
||||
"It is usually a good idea to cache the sampled Dataflow for later uses.\n",
|
||||
"\n",
|
||||
"See [here](cache.ipynb) for more details about caching."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Sampling and Subsetting\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once a Dataflow has been created, it is possible to act on only a subset of the records contained in it. This can help when working with very large datasets or when only a portion of the records is truly relevant."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Head\n",
|
||||
"\n",
|
||||
"The `head` method will take the number of records specified, run them through the transformations in the Dataflow, and then return the result as a Pandas dataframe."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"dflow = dprep.read_csv('../data/crime_duplicate_headers.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Take\n",
|
||||
"\n",
|
||||
"The `take` method adds a step to the Dataflow that will keep the number of records specified (counting from the beginning) and drop the rest. Unlike `head`, which does not modify the Dataflow, all operations applied on a Dataflow on which `take` has been applied will affect only the records kept."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_top_five = dflow.take(5)\n",
|
||||
"dflow_top_five.to_pandas_dataframe()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Skip\n",
|
||||
"\n",
|
||||
"It is also possible to skip a certain number of records in a Dataflow, such that transformations are only applied after a specific point. Depending on the underlying data source, a Dataflow with a `skip` step might still have to scan through the data in order to skip past the records."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_skip_top_one = dflow_top_five.skip(1)\n",
|
||||
"dflow_skip_top_one.to_pandas_dataframe()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Take Sample\n",
|
||||
"\n",
|
||||
"In addition to taking records from the top of the dataset, it's also possible to take a random sample of the dataset. This is done through the `take_sample(probability, seed=None)` method. This method will scan through all of the records available in the Dataflow and include them based on the probability specified. The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `take_sample` will receive different seeds."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_sampled = dflow.take_sample(0.1)\n",
|
||||
"dflow_sampled.to_pandas_dataframe()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"`skip`, `take`, and `take_sample` can all be combined. With this, we can achieve behaviors like getting a random 10% sample fo the middle N records of a dataset."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"seed = 1\n",
|
||||
"dflow_nested_sample = dflow.skip(1).take(5).take_sample(0.5, seed)\n",
|
||||
"dflow_nested_sample.to_pandas_dataframe()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Take Stratified Sample\n",
|
||||
"Besides sampling all by a probability, we also have stratified sampling, provided the strata and strata weights, the probability to sample each stratum with.\n",
|
||||
"This is done through the `take_stratified_sample(columns, fractions, seed=None)` method.\n",
|
||||
"For all records, we will group each record by the columns specified to stratify, and based on the stratum x weight information in `fractions`, include said record.\n",
|
||||
"\n",
|
||||
"Seed behavior is same as in `take_sample`.\n",
|
||||
"\n",
|
||||
"If a stratum is not specified or the record cannot be grouped by said stratum, we default the weight to sample by to 0 (it will not be included).\n",
|
||||
"\n",
|
||||
"The order of `fractions` must match the order of `columns`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"fractions = {}\n",
|
||||
"fractions[('ASSAULT',)] = 0.5\n",
|
||||
"fractions[('BATTERY',)] = 0.2\n",
|
||||
"fractions[('ARSON',)] = 0.5\n",
|
||||
"fractions[('THEFT',)] = 1.0\n",
|
||||
"\n",
|
||||
"columns = ['Primary Type']\n",
|
||||
"\n",
|
||||
"single_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n",
|
||||
"single_strata_sample.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Stratified sampling on multiple columns is also supported."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"fractions = {}\n",
|
||||
"fractions[('ASSAULT', '560')] = 0.5\n",
|
||||
"fractions[('BATTERY', '460')] = 0.2\n",
|
||||
"fractions[('ARSON', '1020')] = 0.5\n",
|
||||
"fractions[('THEFT', '820')] = 1.0\n",
|
||||
"\n",
|
||||
"columns = ['Primary Type', 'IUCR']\n",
|
||||
"\n",
|
||||
"multi_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n",
|
||||
"multi_strata_sample.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Caching\n",
|
||||
"It is usually a good idea to cache the sampled Dataflow for later uses.\n",
|
||||
"\n",
|
||||
"See [here](cache.ipynb) for more details about caching."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,192 +1,192 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Working With File Streams\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In addition to loading and parsing tabular data (see [here](./data-ingestion.ipynb) for more details), Data Prep also supports a variety of operations on raw file streams. \n",
|
||||
"\n",
|
||||
"File streams are usually created by calling `Dataflow.get_files`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.Dataflow.get_files(path='../data/*.csv')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The result of this operation is a Dataflow with a single column named \"Path\". This column contains values of type `StreamInfo`, each of which represents a different file matched by the search pattern specified when calling `get_files`. The string representation of a `StreamInfo` follows this pattern:\n",
|
||||
"\n",
|
||||
"StreamInfo(_Location_://_ResourceIdentifier_\\[_Arguments_\\])\n",
|
||||
"\n",
|
||||
"Location is the type of storage where the stream is located (e.g. Azure Blob, Local, or ADLS); ResouceIdentifier is the name of the file within that storage, such as a file path; and Arguments is a list of arguments required to load and read the file.\n",
|
||||
"\n",
|
||||
"On their own, `StreamInfo` objects are not particularly useful; however, you can use them as input to other functions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retrieving File Names\n",
|
||||
"\n",
|
||||
"In the example above, we matched a set of CSV files by using a search pattern and got back a column with several `StreamInfo` objects, each representing a different file. Now, we will extract the file path and name for each of these values into a new string column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dflow.add_column(expression=dprep.get_stream_name(dflow['Path']),\n",
|
||||
" new_column_name='FilePath',\n",
|
||||
" prior_column='Path')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `get_stream_name` function will return the full name of the file referenced by a `StreamInfo`. In the case of a local file, this will be an absolute path. From here, you can use the `derive_column_by_example` method to extract just the file name."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"first_file_path = dflow.head(1)['FilePath'][0]\n",
|
||||
"first_file_name = os.path.basename(first_file_path)\n",
|
||||
"dflow = dflow.derive_column_by_example(new_column_name='FileName',\n",
|
||||
" source_columns=['FilePath'],\n",
|
||||
" example_data=(first_file_path, first_file_name))\n",
|
||||
"dflow = dflow.drop_columns(['FilePath'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Writing Streams\n",
|
||||
"\n",
|
||||
"Whenever you have a column containing `StreamInfo` objects, it's possible to write these out to any of the locations Data Prep supports. You can do this by calling `Dataflow.write_streams`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.write_streams(streams_column='Path', base_path=dprep.LocalFileOutput('./test_out/')).run_local()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `base_path` parameter specifies the location the files will be written to. By default, the name of the file will be the resource identifier of the stream with any invalid characters replaced by `_`. In the case of streams referencing local files, this would be the full path of the original file. You can also specify the desired file names by referencing a column containing them:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow.write_streams(streams_column='Path',\n",
|
||||
" base_path=dprep.LocalFileOutput('./test_out/'),\n",
|
||||
" file_names_column='FileName').run_local()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using this functionality, you can transfer files from any source to any destination supported by Data Prep. In addition, since the streams are just values in the Dataflow, you can use all of the functionality available.\n",
|
||||
"\n",
|
||||
"Here, for example, we will write out only the files that start with the prefix \"crime-\". The resulting file names will have the prefix stripped and will be written to a folder named \"crime\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prefix = 'crime-'\n",
|
||||
"dflow = dflow.filter(dflow['FileName'].starts_with(prefix))\n",
|
||||
"dflow = dflow.add_column(expression=dflow['FileName'].substring(len(prefix)),\n",
|
||||
" new_column_name='CleanName',\n",
|
||||
" prior_column='FileName')\n",
|
||||
"dflow.write_streams(streams_column='Path',\n",
|
||||
" base_path=dprep.LocalFileOutput('./test_out/crime/'),\n",
|
||||
" file_names_column='CleanName').run_local()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Working With File Streams\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In addition to loading and parsing tabular data (see [here](./data-ingestion.ipynb) for more details), Data Prep also supports a variety of operations on raw file streams. \n",
|
||||
"\n",
|
||||
"File streams are usually created by calling `Dataflow.get_files`."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.Dataflow.get_files(path='../data/*.csv')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The result of this operation is a Dataflow with a single column named \"Path\". This column contains values of type `StreamInfo`, each of which represents a different file matched by the search pattern specified when calling `get_files`. The string representation of a `StreamInfo` follows this pattern:\n",
|
||||
"\n",
|
||||
"StreamInfo(_Location_://_ResourceIdentifier_\\[_Arguments_\\])\n",
|
||||
"\n",
|
||||
"Location is the type of storage where the stream is located (e.g. Azure Blob, Local, or ADLS); ResouceIdentifier is the name of the file within that storage, such as a file path; and Arguments is a list of arguments required to load and read the file.\n",
|
||||
"\n",
|
||||
"On their own, `StreamInfo` objects are not particularly useful; however, you can use them as input to other functions."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retrieving File Names\n",
|
||||
"\n",
|
||||
"In the example above, we matched a set of CSV files by using a search pattern and got back a column with several `StreamInfo` objects, each representing a different file. Now, we will extract the file path and name for each of these values into a new string column."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dflow.add_column(expression=dprep.get_stream_name(dflow['Path']),\n",
|
||||
" new_column_name='FilePath',\n",
|
||||
" prior_column='Path')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `get_stream_name` function will return the full name of the file referenced by a `StreamInfo`. In the case of a local file, this will be an absolute path. From here, you can use the `derive_column_by_example` method to extract just the file name."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"first_file_path = dflow.head(1)['FilePath'][0]\n",
|
||||
"first_file_name = os.path.basename(first_file_path)\n",
|
||||
"dflow = dflow.derive_column_by_example(new_column_name='FileName',\n",
|
||||
" source_columns=['FilePath'],\n",
|
||||
" example_data=(first_file_path, first_file_name))\n",
|
||||
"dflow = dflow.drop_columns(['FilePath'])\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Writing Streams\n",
|
||||
"\n",
|
||||
"Whenever you have a column containing `StreamInfo` objects, it's possible to write these out to any of the locations Data Prep supports. You can do this by calling `Dataflow.write_streams`:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.write_streams(streams_column='Path', base_path=dprep.LocalFileOutput('./test_out/')).run_local()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `base_path` parameter specifies the location the files will be written to. By default, the name of the file will be the resource identifier of the stream with any invalid characters replaced by `_`. In the case of streams referencing local files, this would be the full path of the original file. You can also specify the desired file names by referencing a column containing them:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow.write_streams(streams_column='Path',\n",
|
||||
" base_path=dprep.LocalFileOutput('./test_out/'),\n",
|
||||
" file_names_column='FileName').run_local()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Using this functionality, you can transfer files from any source to any destination supported by Data Prep. In addition, since the streams are just values in the Dataflow, you can use all of the functionality available.\n",
|
||||
"\n",
|
||||
"Here, for example, we will write out only the files that start with the prefix \"crime-\". The resulting file names will have the prefix stripped and will be written to a folder named \"crime\"."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"prefix = 'crime-'\n",
|
||||
"dflow = dflow.filter(dflow['FileName'].starts_with(prefix))\n",
|
||||
"dflow = dflow.add_column(expression=dflow['FileName'].substring(len(prefix)),\n",
|
||||
" new_column_name='CleanName',\n",
|
||||
" prior_column='FileName')\n",
|
||||
"dflow.write_streams(streams_column='Path',\n",
|
||||
" base_path=dprep.LocalFileOutput('./test_out/crime/'),\n",
|
||||
" file_names_column='CleanName').run_local()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,183 +1,183 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Writing Data\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is possible to write out the data at any point in a Dataflow. These writes are added as steps to the resulting Dataflow and will be executed every time the Dataflow is executed. Since there are no limitations to how many write steps there are in a pipeline, this makes it easy to write out intermediate results for troubleshooting or to be picked up by other pipelines.\n",
|
||||
"\n",
|
||||
"It is important to note that the execution of each write results in a full pull of the data in the Dataflow. For example, a Dataflow with three write steps will read and process every record in the dataset three times."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Writing to Files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data can be written to files in any of our supported locations (Local File System, Azure Blob Storage, and Azure Data Lake Storage). In order to parallelize the write, the data is written to multiple partition files. A sentinel file named SUCCESS is also output once the write has completed. This makes it possible to identify when an intermediate write has completed without having to wait for the whole pipeline to complete.\n",
|
||||
"\n",
|
||||
"> When running a Dataflow in Spark, attempting to execute a write to an existing folder will fail. It is important to ensure the folder is empty or use a different target location per execution.\n",
|
||||
"\n",
|
||||
"The following file formats are currently supported:\n",
|
||||
"- Delimited Files (CSV, TSV, etc.)\n",
|
||||
"- Parquet Files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We'll start by loading data into a Dataflow which will be re-used with different formats."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow = dprep.auto_read_file('../data/crime.txt')\n",
|
||||
"dflow = dflow.to_number('Column2')\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Delimited Files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we create a dataflow with a write step.\n",
|
||||
"\n",
|
||||
"This operation is lazy until we invoke `run_local` (or any operation that forces execution like `to_pandas_dataframe`), only then will we execute the write operation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_write = dflow.write_to_csv(directory_path=dprep.LocalFileOutput('./test_out/'))\n",
|
||||
"\n",
|
||||
"dflow_write.run_local()\n",
|
||||
"\n",
|
||||
"dflow_written_files = dprep.read_csv('./test_out/part-*')\n",
|
||||
"dflow_written_files.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The data we wrote out contains several errors in the numeric columns due to numbers that we were unable to parse. When written out to CSV, these are replaced with the string \"ERROR\" by default. We can parameterize this as part of our write call. In the same vein, it is also possible to set what string to use to represent null values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_write_errors = dflow.write_to_csv(directory_path=dprep.LocalFileOutput('./test_out/'), \n",
|
||||
" error='BadData',\n",
|
||||
" na='NA')\n",
|
||||
"dflow_write_errors.run_local()\n",
|
||||
"dflow_written = dprep.read_csv('./test_out/part-*')\n",
|
||||
"dflow_written.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Parquet Files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Similar to `write_to_csv`, `write_to_parquet` returns a new Dataflow with a Write Parquet Step which hasn't been executed yet.\n",
|
||||
"\n",
|
||||
"Then we run the Dataflow with `run_local`, which executes the write operation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_write_parquet = dflow.write_to_parquet(directory_path=dprep.LocalFileOutput('./test_parquet_out/'),\n",
|
||||
" error='MiscreantData')\n",
|
||||
"\n",
|
||||
"dflow_write_parquet.run_local()\n",
|
||||
"\n",
|
||||
"dflow_written_parquet = dprep.read_parquet_file('./test_parquet_out/part-*')\n",
|
||||
"dflow_written_parquet.head(5)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Writing Data\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is possible to write out the data at any point in a Dataflow. These writes are added as steps to the resulting Dataflow and will be executed every time the Dataflow is executed. Since there are no limitations to how many write steps there are in a pipeline, this makes it easy to write out intermediate results for troubleshooting or to be picked up by other pipelines.\n",
|
||||
"\n",
|
||||
"It is important to note that the execution of each write results in a full pull of the data in the Dataflow. For example, a Dataflow with three write steps will read and process every record in the dataset three times."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Writing to Files"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data can be written to files in any of our supported locations (Local File System, Azure Blob Storage, and Azure Data Lake Storage). In order to parallelize the write, the data is written to multiple partition files. A sentinel file named SUCCESS is also output once the write has completed. This makes it possible to identify when an intermediate write has completed without having to wait for the whole pipeline to complete.\n",
|
||||
"\n",
|
||||
"> When running a Dataflow in Spark, attempting to execute a write to an existing folder will fail. It is important to ensure the folder is empty or use a different target location per execution.\n",
|
||||
"\n",
|
||||
"The following file formats are currently supported:\n",
|
||||
"- Delimited Files (CSV, TSV, etc.)\n",
|
||||
"- Parquet Files"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We'll start by loading data into a Dataflow which will be re-used with different formats."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow = dprep.auto_read_file('../data/crime.txt')\n",
|
||||
"dflow = dflow.to_number('Column2')\n",
|
||||
"dflow.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Delimited Files"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we create a dataflow with a write step.\n",
|
||||
"\n",
|
||||
"This operation is lazy until we invoke `run_local` (or any operation that forces execution like `to_pandas_dataframe`), only then will we execute the write operation."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_write = dflow.write_to_csv(directory_path=dprep.LocalFileOutput('./test_out/'))\n",
|
||||
"\n",
|
||||
"dflow_write.run_local()\n",
|
||||
"\n",
|
||||
"dflow_written_files = dprep.read_csv('./test_out/part-*')\n",
|
||||
"dflow_written_files.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The data we wrote out contains several errors in the numeric columns due to numbers that we were unable to parse. When written out to CSV, these are replaced with the string \"ERROR\" by default. We can parameterize this as part of our write call. In the same vein, it is also possible to set what string to use to represent null values."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_write_errors = dflow.write_to_csv(directory_path=dprep.LocalFileOutput('./test_out/'), \n",
|
||||
" error='BadData',\n",
|
||||
" na='NA')\n",
|
||||
"dflow_write_errors.run_local()\n",
|
||||
"dflow_written = dprep.read_csv('./test_out/part-*')\n",
|
||||
"dflow_written.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Parquet Files"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Similar to `write_to_csv`, `write_to_parquet` returns a new Dataflow with a Write Parquet Step which hasn't been executed yet.\n",
|
||||
"\n",
|
||||
"Then we run the Dataflow with `run_local`, which executes the write operation."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dflow_write_parquet = dflow.write_to_parquet(directory_path=dprep.LocalFileOutput('./test_parquet_out/'),\n",
|
||||
" error='MiscreantData')\n",
|
||||
"\n",
|
||||
"dflow_write_parquet.run_local()\n",
|
||||
"\n",
|
||||
"dflow_written_parquet = dprep.read_parquet_file('./test_parquet_out/part-*')\n",
|
||||
"dflow_written_parquet.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,441 +1,441 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Getting started with Azure ML Data Prep SDK\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"Wonder how you can make the most of the Azure ML Data Prep SDK? In this \"Getting Started\" guide, we'll demonstrate how to do your normal data wrangling with this SDK and showcase a few highlights that make this SDK shine. Using a sample of this [Kaggle crime dataset](https://www.kaggle.com/currie32/crimes-in-chicago/home) as an example, we'll cover how to:\n",
|
||||
"\n",
|
||||
"* [Read in data](#Read)\n",
|
||||
"* [Profile your data](#Profile)\n",
|
||||
"* [Append rows](#Append)\n",
|
||||
"* [Apply common data science transforms](#Data-science-transforms)\n",
|
||||
" * [Summarize](#Summarize)\n",
|
||||
" * [Join](#Join)\n",
|
||||
" * [Filter](#Filter)\n",
|
||||
" * [Replace](#Replace)\n",
|
||||
"* [Consume your cleaned dataset](#Consume)\n",
|
||||
"* [Explore advanced features](#Explore)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from IPython.display import display\n",
|
||||
"from os import path\n",
|
||||
"from tempfile import mkdtemp\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"# Paths for datasets\n",
|
||||
"file_crime_dirty = '../../data/crime-dirty.csv'\n",
|
||||
"file_crime_spring = '../../data/crime-spring.csv'\n",
|
||||
"file_crime_winter = '../../data/crime-winter.csv'\n",
|
||||
"file_aldermen = '../../data/chicago-aldermen-2015.csv'\n",
|
||||
"\n",
|
||||
"# Seed\n",
|
||||
"RAND_SEED = 7251"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Read\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Read in data\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet) and the ability to infer column types automatically. To see how powerful the `auto_read_file` capability is, let's take a peek at the `dirty-crime.csv`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dprep.read_csv(path=file_crime_dirty).head(7)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A common occurrence in many datasets is to have a column of values with commas; in our case, the last column represents location in the form of longitude-latitude pair. The default CSV reader interprets this comma as a delimiter and thus splits the data into two columns. Furthermore, it incorrectly reads in the header as the column name. Normally, we would need to `skip` the header and specify the delimiter as `|`, but our `auto_read_file` eliminates that work:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"crime_dirty = dprep.auto_read_file(path=file_crime_dirty)\n",
|
||||
"\n",
|
||||
"crime_dirty.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ if you'd like to specify the file type and adjust how you want to read files in, you can see the list of our specialized file readers and how to use them [here](../../how-to-guides/data-ingestion.ipynb)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Profile\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Profile your data\n",
|
||||
"\n",
|
||||
"Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics. Notice that our auto file reader automatically guessed the column type:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"crime_dirty.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Append\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Append rows\n",
|
||||
"\n",
|
||||
"What if your data is split across multiple files? We support the ability to append multiple datasets column-wise and row-wise. Here, we demonstrate how you can coalesce datasets row-wise:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Datasets with the same schema as crime_dirty\n",
|
||||
"crime_winter = dprep.auto_read_file(path=file_crime_winter)\n",
|
||||
"crime_spring = dprep.auto_read_file(path=file_crime_spring)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"crime = (crime_dirty.append_rows(dataflows=[crime_winter, crime_spring]))\n",
|
||||
"\n",
|
||||
"crime.take_sample(probability=0.25, seed=RAND_SEED).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ you can learn how to append column-wise and how to deal with appending data with different schemas [here](../../how-to-guides/append-columns-and-rows.ipynb)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Data-science-transforms\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Apply common data science transforms\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep supports almost all common data science transforms found in other industry-standard data science libraries. Here, we'll explore the ability to `summarize`, `join`, `filter`, and `replace`. \n",
|
||||
"\n",
|
||||
"__Advanced features:__\n",
|
||||
"* We also provide \"smart\" transforms not found in pandas that use machine learning to [derive new columns](../../how-to-guides/derive-column-by-example.ipynb), [split columns](../../how-to-guides/split-column-by-example.ipynb), and [fuzzy grouping](../../how-to-guides/fuzzy-group.ipynb).\n",
|
||||
"* Finally, we also help featurize your dataset to prepare it for machine learning; learn more about our featurizers like [one-hot encoder](../../how-to-guides/one-hot-encoder.ipynb), [label encoder](../../how-to-guides/label-encoder.ipynb), [min-max scaler](../../how-to-guides/min-max-scaler.ipynb), and [random (train-test) split](../../how-to-guides/random-split.ipynb).\n",
|
||||
"* Our complete list of example Notebooks for transforms can be found in our [How-to Guides](../../how-to-guides)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Summarize\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Summarize\n",
|
||||
"\n",
|
||||
"Let's see which wards had the most crimes in our sample dataset:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"crime_summary = (crime\n",
|
||||
" .summarize(\n",
|
||||
" summary_columns=[\n",
|
||||
" dprep.SummaryColumnsValue(\n",
|
||||
" column_id='ID', \n",
|
||||
" summary_column_name='total_ward_crimes', \n",
|
||||
" summary_function=dprep.SummaryFunction.COUNT\n",
|
||||
" )\n",
|
||||
" ],\n",
|
||||
" group_by_columns=['Ward']\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"(crime_summary\n",
|
||||
" .sort(sort_order=[('total_ward_crimes', True)])\n",
|
||||
" .head(5)\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Join\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Join\n",
|
||||
"\n",
|
||||
"Let's annotate each observation with more information about the ward where the crime occurred. Let's do so by joining `crime` with a dataset which lists the current aldermen for each ward:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"aldermen = dprep.auto_read_file(path=file_aldermen)\n",
|
||||
"\n",
|
||||
"aldermen.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"crime.join(\n",
|
||||
" left_dataflow=crime,\n",
|
||||
" right_dataflow=aldermen,\n",
|
||||
" join_key_pairs=[\n",
|
||||
" ('Ward', 'Ward')\n",
|
||||
" ]\n",
|
||||
").head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ [Learn more](../../how-to-guides/join.ipynb) about how you can do all variants of `join`, like inner-, left-, right-, anti-, and semi-joins."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Filter\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filter\n",
|
||||
"\n",
|
||||
"Let's look at theft crimes:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"theft = crime.filter(crime['Primary Type'] == 'THEFT')\n",
|
||||
"\n",
|
||||
"theft.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Replace\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Replace\n",
|
||||
"\n",
|
||||
"Notice that our `theft` dataset has empty strings in column `Location`. Let's replace those with a missing value:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"theft_replaced = (theft\n",
|
||||
" .replace_na(\n",
|
||||
" columns=['Location'], \n",
|
||||
" use_empty_string_as_na=True\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"theft_replaced.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ [Learn more](../../how-to-guides/replace-fill-error.ipynb) about more advanced `replace` and `fill` capabilities."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Consume\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Consume your cleaned dataset\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep allows you to \"choose your own adventure\" once you're done wrangling. You can:\n",
|
||||
"\n",
|
||||
"1. Write to a pandas dataframe\n",
|
||||
"2. Execute on Spark\n",
|
||||
"3. Consume directly in Azure Machine Learning models\n",
|
||||
"\n",
|
||||
"In this quickstart guide, we'll show how you can export to a pandas dataframe.\n",
|
||||
"\n",
|
||||
"__Advanced features:__ \n",
|
||||
"* One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out.\n",
|
||||
"* You can directly consume your new DataFlow in model builders like Azure Machine Learning's [automated machine learning](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/dataprep/auto-ml-dataprep.ipynb)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"theft_replaced.to_pandas_dataframe()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Explore\"></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Explore advanced features\n",
|
||||
"\n",
|
||||
"Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:\n",
|
||||
"\n",
|
||||
"* [Cache your Dataflow to speed up your iterations](../../how-to-guides/cache.ipynb)\n",
|
||||
"* [Add your custom Python transforms](../../how-to-guides/custom-python-transforms.ipynb)\n",
|
||||
"* [Impute missing values](../../how-to-guides/impute-missing-values.ipynb)\n",
|
||||
"* [Sample your data](../../how-to-guides/subsetting-sampling.ipynb)\n",
|
||||
"* [Reference and link between Dataflows](../../how-to-guides/join.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.2"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
{
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python36",
|
||||
"language": "python"
|
||||
},
|
||||
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"language_info": {
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"name": "python",
|
||||
"file_extension": ".py",
|
||||
"nbconvert_exporter": "python",
|
||||
"version": "3.6.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"cells": [
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Getting started with Azure ML Data Prep SDK\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"Wonder how you can make the most of the Azure ML Data Prep SDK? In this \"Getting Started\" guide, we'll demonstrate how to do your normal data wrangling with this SDK and showcase a few highlights that make this SDK shine. Using a sample of this [Kaggle crime dataset](https://www.kaggle.com/currie32/crimes-in-chicago/home) as an example, we'll cover how to:\n",
|
||||
"\n",
|
||||
"* [Read in data](#Read)\n",
|
||||
"* [Profile your data](#Profile)\n",
|
||||
"* [Append rows](#Append)\n",
|
||||
"* [Apply common data science transforms](#Data-science-transforms)\n",
|
||||
" * [Summarize](#Summarize)\n",
|
||||
" * [Join](#Join)\n",
|
||||
" * [Filter](#Filter)\n",
|
||||
" * [Replace](#Replace)\n",
|
||||
"* [Consume your cleaned dataset](#Consume)\n",
|
||||
"* [Explore advanced features](#Explore)\n"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"from IPython.display import display\n",
|
||||
"from os import path\n",
|
||||
"from tempfile import mkdtemp\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"# Paths for datasets\n",
|
||||
"file_crime_dirty = '../../data/crime-dirty.csv'\n",
|
||||
"file_crime_spring = '../../data/crime-spring.csv'\n",
|
||||
"file_crime_winter = '../../data/crime-winter.csv'\n",
|
||||
"file_aldermen = '../../data/chicago-aldermen-2015.csv'\n",
|
||||
"\n",
|
||||
"# Seed\n",
|
||||
"RAND_SEED = 7251"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Read\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Read in data\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet) and the ability to infer column types automatically. To see how powerful the `auto_read_file` capability is, let's take a peek at the `dirty-crime.csv`:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"dprep.read_csv(path=file_crime_dirty).head(7)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A common occurrence in many datasets is to have a column of values with commas; in our case, the last column represents location in the form of longitude-latitude pair. The default CSV reader interprets this comma as a delimiter and thus splits the data into two columns. Furthermore, it incorrectly reads in the header as the column name. Normally, we would need to `skip` the header and specify the delimiter as `|`, but our `auto_read_file` eliminates that work:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"crime_dirty = dprep.auto_read_file(path=file_crime_dirty)\n",
|
||||
"\n",
|
||||
"crime_dirty.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ if you'd like to specify the file type and adjust how you want to read files in, you can see the list of our specialized file readers and how to use them [here](../../how-to-guides/data-ingestion.ipynb)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Profile\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Profile your data\n",
|
||||
"\n",
|
||||
"Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics. Notice that our auto file reader automatically guessed the column type:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"crime_dirty.get_profile()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Append\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Append rows\n",
|
||||
"\n",
|
||||
"What if your data is split across multiple files? We support the ability to append multiple datasets column-wise and row-wise. Here, we demonstrate how you can coalesce datasets row-wise:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"# Datasets with the same schema as crime_dirty\n",
|
||||
"crime_winter = dprep.auto_read_file(path=file_crime_winter)\n",
|
||||
"crime_spring = dprep.auto_read_file(path=file_crime_spring)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"crime = (crime_dirty.append_rows(dataflows=[crime_winter, crime_spring]))\n",
|
||||
"\n",
|
||||
"crime.take_sample(probability=0.25, seed=RAND_SEED).head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ you can learn how to append column-wise and how to deal with appending data with different schemas [here](../../how-to-guides/append-columns-and-rows.ipynb)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Data-science-transforms\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Apply common data science transforms\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep supports almost all common data science transforms found in other industry-standard data science libraries. Here, we'll explore the ability to `summarize`, `join`, `filter`, and `replace`. \n",
|
||||
"\n",
|
||||
"__Advanced features:__\n",
|
||||
"* We also provide \"smart\" transforms not found in pandas that use machine learning to [derive new columns](../../how-to-guides/derive-column-by-example.ipynb), [split columns](../../how-to-guides/split-column-by-example.ipynb), and [fuzzy grouping](../../how-to-guides/fuzzy-group.ipynb).\n",
|
||||
"* Finally, we also help featurize your dataset to prepare it for machine learning; learn more about our featurizers like [one-hot encoder](../../how-to-guides/one-hot-encoder.ipynb), [label encoder](../../how-to-guides/label-encoder.ipynb), [min-max scaler](../../how-to-guides/min-max-scaler.ipynb), and [random (train-test) split](../../how-to-guides/random-split.ipynb).\n",
|
||||
"* Our complete list of example Notebooks for transforms can be found in our [How-to Guides](../../how-to-guides)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Summarize\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Summarize\n",
|
||||
"\n",
|
||||
"Let's see which wards had the most crimes in our sample dataset:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"crime_summary = (crime\n",
|
||||
" .summarize(\n",
|
||||
" summary_columns=[\n",
|
||||
" dprep.SummaryColumnsValue(\n",
|
||||
" column_id='ID', \n",
|
||||
" summary_column_name='total_ward_crimes', \n",
|
||||
" summary_function=dprep.SummaryFunction.COUNT\n",
|
||||
" )\n",
|
||||
" ],\n",
|
||||
" group_by_columns=['Ward']\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"(crime_summary\n",
|
||||
" .sort(sort_order=[('total_ward_crimes', True)])\n",
|
||||
" .head(5)\n",
|
||||
")"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Join\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Join\n",
|
||||
"\n",
|
||||
"Let's annotate each observation with more information about the ward where the crime occurred. Let's do so by joining `crime` with a dataset which lists the current aldermen for each ward:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"aldermen = dprep.auto_read_file(path=file_aldermen)\n",
|
||||
"\n",
|
||||
"aldermen.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"crime.join(\n",
|
||||
" left_dataflow=crime,\n",
|
||||
" right_dataflow=aldermen,\n",
|
||||
" join_key_pairs=[\n",
|
||||
" ('Ward', 'Ward')\n",
|
||||
" ]\n",
|
||||
").head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ [Learn more](../../how-to-guides/join.ipynb) about how you can do all variants of `join`, like inner-, left-, right-, anti-, and semi-joins."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Filter\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filter\n",
|
||||
"\n",
|
||||
"Let's look at theft crimes:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"theft = crime.filter(crime['Primary Type'] == 'THEFT')\n",
|
||||
"\n",
|
||||
"theft.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Replace\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Replace\n",
|
||||
"\n",
|
||||
"Notice that our `theft` dataset has empty strings in column `Location`. Let's replace those with a missing value:"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"theft_replaced = (theft\n",
|
||||
" .replace_na(\n",
|
||||
" columns=['Location'], \n",
|
||||
" use_empty_string_as_na=True\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"theft_replaced.head(5)"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"__Advanced features:__ [Learn more](../../how-to-guides/replace-fill-error.ipynb) about more advanced `replace` and `fill` capabilities."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Consume\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Consume your cleaned dataset\n",
|
||||
"\n",
|
||||
"Azure ML Data Prep allows you to \"choose your own adventure\" once you're done wrangling. You can:\n",
|
||||
"\n",
|
||||
"1. Write to a pandas dataframe\n",
|
||||
"2. Execute on Spark\n",
|
||||
"3. Consume directly in Azure Machine Learning models\n",
|
||||
"\n",
|
||||
"In this quickstart guide, we'll show how you can export to a pandas dataframe.\n",
|
||||
"\n",
|
||||
"__Advanced features:__ \n",
|
||||
"* One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out.\n",
|
||||
"* You can directly consume your new DataFlow in model builders like Azure Machine Learning's [automated machine learning](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/dataprep/auto-ml-dataprep.ipynb)."
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"source": [
|
||||
"theft_replaced.to_pandas_dataframe()"
|
||||
],
|
||||
"cell_type": "code"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"Explore\"></a>"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Explore advanced features\n",
|
||||
"\n",
|
||||
"Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:\n",
|
||||
"\n",
|
||||
"* [Cache your Dataflow to speed up your iterations](../../how-to-guides/cache.ipynb)\n",
|
||||
"* [Add your custom Python transforms](../../how-to-guides/custom-python-transforms.ipynb)\n",
|
||||
"* [Impute missing values](../../how-to-guides/impute-missing-values.ipynb)\n",
|
||||
"* [Sample your data](../../how-to-guides/subsetting-sampling.ipynb)\n",
|
||||
"* [Reference and link between Dataflows](../../how-to-guides/join.ipynb)"
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"cell_type": "markdown"
|
||||
}
|
||||
],
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user