Azure Machine Learning Datasets (preview)

Azure Machine Learning Datasets (preview) make it easier to access and work with your data. Datasets manage data in various scenarios such as model training and pipeline creation. Using the Azure Machine Learning SDK, you can access underlying storage, explore and prepare data, manage the life cycle of different Dataset definitions, and compare between Datasets used in training and in production.

Create and Register Datasets

It's easy to create Datasets from either local files, or Azure Datastores.

from azureml.core.workspace import Workspace
from azureml.core.datastore import Datastore
from azureml.core.dataset import Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()

# get Datastore from the workspace
dstore = Datastore.get(workspace, datastore_name)

# create an in-memory Dataset on your local machine
dataset = Dataset.from_delimited_files(dstore.path('data/src/crime.csv'))

To consume Datasets across various scenarios in Azure Machine Learning service such as automated machine learning, model training and pipeline creation, you need to register the Datasets with your workspace. By doing so, you can also share and reuse the Datasets within your organization.

dataset = dataset.register(workspace = workspace,
                           name = 'dataset_crime',
                           description = 'Training data'
                           )

Sampling

Sampling can be particular useful with Datasets that are too large to efficiently analyze in full. It enables data scientists to work with a manageable amount of data to build and train machine learning models. At this time, the sample() method from the Dataset class supports Top N, Simple Random, and Stratified sampling strategies.

After sampling, you can convert your sampled Dataset to pandas DataFrame for training. By using the native sample() method from the Dataset class, you will load the sampled data on the fly instead of loading full data into memory.

Top N sample

For Top N sampling, the first n records of your Dataset are your sample. This is helpful if you are just trying to get an idea of what your data records look like or to see what fields are in your data.

top_n_sample_dataset = dataset.sample('top_n', {'n': 5})
top_n_sample_dataset.to_pandas_dataframe()

Simple random sample

In Simple Random sampling, every member of the data population has an equal chance of being selected as a part of the sample. In the simple_random sample strategy, the records from your Dataset are selected based on the probability specified and returns a modified Dataset. The seed parameter is optional.

simple_random_sample_dataset = dataset.sample('simple_random', {'probability':0.3, 'seed': seed})
simple_random_sample_dataset.to_pandas_dataframe()

Stratified sample

Stratified samples ensure that certain groups of a population are represented in the sample. In the stratified sample strategy, the population is divided into strata, or subgroups, based on similarities, and records are randomly selected from each strata according to the strata weights indicated by the fractions parameter.

In the following example, we group each record by the specified columns, and include said record based on the strata X weight information in fractions. If a strata is not specified or the record cannot be grouped, the default weight to sample is 0.

# take 50% of records with `Primary Type` as `THEFT` and 20% of records with `Primary Type` as `DECEPTIVE PRACTICE` into sample Dataset
fractions = {}
fractions[('THEFT',)] = 0.5
fractions[('DECEPTIVE PRACTICE',)] = 0.2

sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions, 'seed': seed})

sample_dataset.to_pandas_dataframe()

Explore with summary statistics

Detect anomalies, missing values, or error counts with the get_profile() method. This function gets the profile and summary statistics of your data, which in turn helps determine the necessary data preparation operations to apply.

# get pre-calculated profile
# if there is no precalculated profile available or the precalculated profile is not up-to-date, this method will generate a new profile of the Dataset
dataset.get_profile()

	Type	Min	Max	Count	Missing Count	Not Missing Count	Percent missing	Empty count	0.1% Quantile	1% Quantile	5% Quantile	25% Quantile	50% Quantile	75% Quantile	95% Quantile	99% Quantile	99.9% Quantile	Mean	Standard Deviation	Variance	Skewness	Kurtosis
ID	FieldType.INTEGER	1.04986e+07	1.05351e+07	10.0	0.0	10.0	0.0	0.0	1.04986e+07	1.04992e+07	1.04986e+07	1.05166e+07	1.05209e+07	1.05259e+07	1.05351e+07	1.05351e+07	1.05351e+07	1.05195e+07	12302.7	1.51358e+08	-0.495701	-1.02814
Case Number	FieldType.STRING	HZ239907	HZ278872	10.0	0.0	10.0	0.0	0.0
Date	FieldType.DATE	2016-04-04 23:56:00+00:00	2016-04-15 17:00:00+00:00	10.0	0.0	10.0	0.0	0.0
Block	FieldType.STRING	004XX S KILBOURN AVE	113XX S PRAIRIE AVE	10.0	0.0	10.0	0.0	0.0
IUCR	FieldType.INTEGER	810	1154	10.0	0.0	10.0	0.0	0.0	810	850	810	890	1136	1153	1154	1154	1154	1058.5	137.285	18847.2	-0.785501	-1.3543
Primary Type	FieldType.STRING	DECEPTIVE PRACTICE	THEFT	10.0	0.0	10.0	0.0	0.0
Description	FieldType.STRING	BOGUS CHECK	OVER $500	10.0	0.0	10.0	0.0	0.0
Location Description	FieldType.STRING		SCHOOL, PUBLIC, BUILDING	10.0	0.0	10.0	0.0	1.0
Arrest	FieldType.BOOLEAN	False	False	10.0	0.0	10.0	0.0	0.0
Domestic	FieldType.BOOLEAN	False	False	10.0	0.0	10.0	0.0	0.0
Beat	FieldType.INTEGER	531	2433	10.0	0.0	10.0	0.0	0.0	531	531	531	614	1318.5	1911	2433	2433	2433	1371.1	692.094	478994	0.105418	-1.60684
District	FieldType.INTEGER	5	24	10.0	0.0	10.0	0.0	0.0	5	5	5	6	13	19	24	24	24	13.5	6.94822	48.2778	0.0930109	-1.62325
Ward	FieldType.INTEGER	1	48	10.0	0.0	10.0	0.0	0.0	1	5	1	9	22.5	40	48	48	48	24.5	16.2635	264.5	0.173723	-1.51271
Community Area	FieldType.INTEGER	4	77	10.0	0.0	10.0	0.0	0.0	4	8.5	4	24	37.5	71	77	77	77	41.2	26.6366	709.511	0.112157	-1.73379
FBI Code	FieldType.INTEGER	6	11	10.0	0.0	10.0	0.0	0.0	6	6	6	6	11	11	11	11	11	9.4	2.36643	5.6	-0.702685	-1.59582
X Coordinate	FieldType.INTEGER	1.16309e+06	1.18336e+06	10.0	7.0	3.0	0.7	0.0	1.16309e+06	1.16309e+06	1.16309e+06	1.16401e+06	1.16678e+06	1.17921e+06	1.18336e+06	1.18336e+06	1.18336e+06	1.17108e+06	10793.5	1.165e+08	0.335126	-2.33333
Y Coordinate	FieldType.INTEGER	1.8315e+06	1.908e+06	10.0	7.0	3.0	0.7	0.0	1.8315e+06	1.8315e+06	1.8315e+06	1.83614e+06	1.85005e+06	1.89352e+06	1.908e+06	1.908e+06	1.908e+06	1.86319e+06	39905.2	1.59243e+09	0.293465	-2.33333
Year	FieldType.INTEGER	2016	2016	10.0	0.0	10.0	0.0	0.0	2016	2016	2016	2016	2016	2016	2016	2016	2016	2016	0	0	NaN	NaN
Updated On	FieldType.DATE	2016-05-11 15:48:00+00:00	2016-05-27 15:45:00+00:00	10.0	0.0	10.0	0.0	0.0
Latitude	FieldType.DECIMAL	41.6928	41.9032	10.0	7.0	3.0	0.7	0.0	41.6928	41.6928	41.6928	41.7057	41.7441	41.8634	41.9032	41.9032	41.9032	41.78	0.109695	0.012033	0.292478	-2.33333
Longitude	FieldType.DECIMAL	-87.6764	-87.6043	10.0	7.0	3.0	0.7	0.0	-87.6764	-87.6764	-87.6764	-87.6734	-87.6645	-87.6194	-87.6043	-87.6043	-87.6043	-87.6484	0.0386264	0.001492	0.344429	-2.33333
Location	FieldType.STRING		(41.903206037, -87.676361925)	10.0	0.0	10.0	0.0	7.0

Training with Dataset

Now that you have registered your Dataset, you can call up the Dataset and convert it to pandas DataFrame or Spark DataFrame easily in your train.py script.

# Sample train.py script
import azureml.core
import pandas as pd
import datetime
import shutil
from azureml.core import Workspace, Datastore, Dataset, Experiment, Run
from sklearn.model_selection import train_test_split
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from sklearn.tree import DecisionTreeClassifier

run = Run.get_context()
workspace = run.experiment.workspace

# Access Dataset registered with the workspace by name
dataset_name = 'training_data'
dataset = Dataset.get(workspace=workspace, name=dataset_name)

ds_def = dataset.get_definition()
dataset_val, dataset_train = ds_def.random_split(percentage=0.3)
y_df = dataset_train.keep_columns(['HasDetections']).to_pandas_dataframe()
x_df = dataset_train.drop_columns(['HasDetections']).to_pandas_dataframe()
y_val = dataset_val.keep_columns(['HasDetections']).to_pandas_dataframe()
x_val = dataset_val.drop_columns(['HasDetections']).to_pandas_dataframe()

data = {"train": {"X": x_df, "y": y_df},
        "validation": {"X": x_val, "y": y_val}}

clf = DecisionTreeClassifier().fit(data["train"]["X"], data["train"]["y"])
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_df, y_df)))
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'.format(clf.score(x_val, y_val)))

For an end-to-end tutorial, you may refer to Dataset tutorial. You will learn how to:

Explore and prepare data for training the model.
Register the Dataset in your workspace for easy access in training.
Take snapshots of data to ensure models can be trained with the same data every time.
Use registered Dataset in your training script.
Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts.