8.6 KiB
Dataset API change notice
Why are Dataset API changes essential?
The existing Dataset class only supports data in tabular format. In order to support binary data and address a wider range of machine learning scenarios including deep learning, we will introduce Dataset types. Datasets are categorized into various types based on how users consume them in training. List of Dataset types:
- TabularDataset: Represents data in a tabular format by parsing the provided file or list of files. TabularDataset can be created from csv, tsv, parquet files, SQL query results etc. For the complete list, please visit our documentation. It provides you with the ability to materialize the data into a pandas DataFrame.
- FileDataset: References single or multiple files in your datastores or public urls. The files can be of any format. FileDataset provides you with the ability to download or mount the files to your compute.
In order to transit from the current Dataset design to typed Dataset, we will deprecate the following methods over time.
Which methods on Dataset class will be deprecated in upcoming releases?
| Methods to be deprecated | Replacement in the new version |
|---|---|
| Dataset.get() | Dataset.get_by_name() |
| Dataset.from_pandas_dataframe() | Creating a Dataset from in-memory DataFrame or local files will cause errors in training on remote compute. Therefore, the new Dataset design will only support creating Datasets from paths in datastores or public web urls. If you are using pandas, you can write the DataFrame into a parquet file, upload it to the cloud, and create a TabularDataset referencing the parquet file using Dataset.Tabular.from_parquet_files() |
| Dataset.from_delimited_files() | Dataset.Tabular.from_delimited_files() |
| Dataset.auto_read_files() | auto_read_files does not always produce results that match with users' expectation. To avoid confusion, this method is not introduced with TabularDataset for now. Please use Dataset.Tabular.from_parquet_files() or Dataset.Tabular.from_delimited_files() depending on your file format. |
| Dataset.from_parquet_files() | Dataset.Tabular.from_parquet_files() |
| Dataset.from_sql_query() | Dataset.Tabular.from_sql_query() |
| Dataset.from_excel_files() | We will support creating a TabularDataset from Excel files in a future release. |
| Dataset.from_json_files() | Dataset.Tabular.from_json_lines_files |
| Dataset.to_pandas_dataframe() | TabularDataset.to_pandas_dataframe() |
| Dataset.to_spark_dataframe() | TabularDataset.to_spark_dataframe() |
| Dataset.head(3) | TabularDataset.take(3).to_pandas_dataframe() |
| Dataset.sample() | TabularDataset.take_sample() |
| Dataset.from_binary_files() | Dataset.File.from_files() |
Why should I use the new Dataset API if I'm only dealing with tabular data?
The current Dataset will be kept around for backward compatibility, but we strongly encourage you to move to TabularDataset for the new capabilities listed below:
- You are able to version and track the new typed Datasets. Learn How
- You are able to use TabularDatasets as automated ML input. Learn How
- You are able to use the new typed Datasets as ScriptRun, Estimator, HyperDrive input. Learn How
- You are be able to use the new typed Datasets in Azure Machine Learning Pipelines. Learn How
How to migrate registered Datasets to new typed Datasets?
We handled the migration for you. All legacy datasets are migrated to new typed Datasets automatically. To use registered datasets, simply call Dataset.get_by_name.
How to provide feedback?
If you have any feedback about our product, or if there is any missing capability that is essential for you to use new Dataset API, please email us at AskAzureMLData@microsoft.com.