Azure Machine Learning Datasets (preview)
Azure Machine Learning Datasets (preview) make it easier to access and work with your data. Datasets manage data in various scenarios such as model training and pipeline creation. Using the Azure Machine Learning SDK, you can access underlying storage, explore and prepare data, manage the life cycle of different Dataset definitions, and compare between Datasets used in training and in production.
Create and Register Datasets
It's easy to create Datasets from either local files, or Azure Datastores.
from azureml.core.workspace import Workspace
from azureml.core.datastore import Datastore
from azureml.core.dataset import Dataset
datastore_name = 'your datastore name'
# get existing workspace
workspace = Workspace.from_config()
# get Datastore from the workspace
dstore = Datastore.get(workspace, datastore_name)
# create an in-memory Dataset on your local machine
dataset = Dataset.from_delimited_files(dstore.path('data/src/crime.csv'))
To consume Datasets across various scenarios in Azure Machine Learning service such as automated machine learning, model training and pipeline creation, you need to register the Datasets with your workspace. By doing so, you can also share and reuse the Datasets within your organization.
dataset = dataset.register(workspace = workspace,
name = 'dataset_crime',
description = 'Training data'
)
Sampling
Sampling can be particular useful with Datasets that are too large to efficiently analyze in full. It enables data scientists to work with a manageable amount of data to build and train machine learning models. At this time, the sample() method from the Dataset class supports Top N, Simple Random, and Stratified sampling strategies.
After sampling, you can convert your sampled Dataset to pandas DataFrame for training. By using the native sample() method from the Dataset class, you will load the sampled data on the fly instead of loading full data into memory.
Top N sample
For Top N sampling, the first n records of your Dataset are your sample. This is helpful if you are just trying to get an idea of what your data records look like or to see what fields are in your data.
top_n_sample_dataset = dataset.sample('top_n', {'n': 5})
top_n_sample_dataset.to_pandas_dataframe()
Simple random sample
In Simple Random sampling, every member of the data population has an equal chance of being selected as a part of the sample. In the simple_random sample strategy, the records from your Dataset are selected based on the probability specified and returns a modified Dataset. The seed parameter is optional.
simple_random_sample_dataset = dataset.sample('simple_random', {'probability':0.3, 'seed': seed})
simple_random_sample_dataset.to_pandas_dataframe()
Stratified sample
Stratified samples ensure that certain groups of a population are represented in the sample. In the stratified sample strategy, the population is divided into strata, or subgroups, based on similarities, and records are randomly selected from each strata according to the strata weights indicated by the fractions parameter.
In the following example, we group each record by the specified columns, and include said record based on the strata X weight information in fractions. If a strata is not specified or the record cannot be grouped, the default weight to sample is 0.
# take 50% of records with `Primary Type` as `THEFT` and 20% of records with `Primary Type` as `DECEPTIVE PRACTICE` into sample Dataset
fractions = {}
fractions[('THEFT',)] = 0.5
fractions[('DECEPTIVE PRACTICE',)] = 0.2
sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions, 'seed': seed})
sample_dataset.to_pandas_dataframe()
Explore with summary statistics
Detect anomalies, missing values, or error counts with the get_profile() method. This function gets the profile and summary statistics of your data, which in turn helps determine the necessary data preparation operations to apply.
# get pre-calculated profile
# if there is no precalculated profile available or the precalculated profile is not up-to-date, this method will generate a new profile of the Dataset
dataset.get_profile()
| Type | Min | Max | Count | Missing Count | Not Missing Count | Percent missing | Error Count | Empty count | 0.1% Quantile | 1% Quantile | 5% Quantile | 25% Quantile | 50% Quantile | 75% Quantile | 95% Quantile | 99% Quantile | 99.9% Quantile | Mean | Standard Deviation | Variance | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | FieldType.INTEGER | 1.04986e+07 | 1.05351e+07 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 1.04986e+07 | 1.04992e+07 | 1.04986e+07 | 1.05166e+07 | 1.05209e+07 | 1.05259e+07 | 1.05351e+07 | 1.05351e+07 | 1.05351e+07 | 1.05195e+07 | 12302.7 | 1.51358e+08 | -0.495701 | -1.02814 |
| Case Number | FieldType.STRING | HZ239907 | HZ278872 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Date | FieldType.DATE | 2016-04-04 23:56:00+00:00 | 2016-04-15 17:00:00+00:00 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Block | FieldType.STRING | 004XX S KILBOURN AVE | 113XX S PRAIRIE AVE | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| IUCR | FieldType.INTEGER | 810 | 1154 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 810 | 850 | 810 | 890 | 1136 | 1153 | 1154 | 1154 | 1154 | 1058.5 | 137.285 | 18847.2 | -0.785501 | -1.3543 |
| Primary Type | FieldType.STRING | DECEPTIVE PRACTICE | THEFT | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Description | FieldType.STRING | BOGUS CHECK | OVER $500 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Location Description | FieldType.STRING | SCHOOL, PUBLIC, BUILDING | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 1.0 | |||||||||||||||
| Arrest | FieldType.BOOLEAN | False | False | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Domestic | FieldType.BOOLEAN | False | False | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Beat | FieldType.INTEGER | 531 | 2433 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 531 | 531 | 531 | 614 | 1318.5 | 1911 | 2433 | 2433 | 2433 | 1371.1 | 692.094 | 478994 | 0.105418 | -1.60684 |
| District | FieldType.INTEGER | 5 | 24 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 5 | 5 | 5 | 6 | 13 | 19 | 24 | 24 | 24 | 13.5 | 6.94822 | 48.2778 | 0.0930109 | -1.62325 |
| Ward | FieldType.INTEGER | 1 | 48 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 1 | 5 | 1 | 9 | 22.5 | 40 | 48 | 48 | 48 | 24.5 | 16.2635 | 264.5 | 0.173723 | -1.51271 |
| Community Area | FieldType.INTEGER | 4 | 77 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 4 | 8.5 | 4 | 24 | 37.5 | 71 | 77 | 77 | 77 | 41.2 | 26.6366 | 709.511 | 0.112157 | -1.73379 |
| FBI Code | FieldType.INTEGER | 6 | 11 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 6 | 6 | 6 | 6 | 11 | 11 | 11 | 11 | 11 | 9.4 | 2.36643 | 5.6 | -0.702685 | -1.59582 |
| X Coordinate | FieldType.INTEGER | 1.16309e+06 | 1.18336e+06 | 10.0 | 7.0 | 3.0 | 0.7 | 0.0 | 0.0 | 1.16309e+06 | 1.16309e+06 | 1.16309e+06 | 1.16401e+06 | 1.16678e+06 | 1.17921e+06 | 1.18336e+06 | 1.18336e+06 | 1.18336e+06 | 1.17108e+06 | 10793.5 | 1.165e+08 | 0.335126 | -2.33333 |
| Y Coordinate | FieldType.INTEGER | 1.8315e+06 | 1.908e+06 | 10.0 | 7.0 | 3.0 | 0.7 | 0.0 | 0.0 | 1.8315e+06 | 1.8315e+06 | 1.8315e+06 | 1.83614e+06 | 1.85005e+06 | 1.89352e+06 | 1.908e+06 | 1.908e+06 | 1.908e+06 | 1.86319e+06 | 39905.2 | 1.59243e+09 | 0.293465 | -2.33333 |
| Year | FieldType.INTEGER | 2016 | 2016 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 2016 | 2016 | 2016 | 2016 | 2016 | 2016 | 2016 | 2016 | 2016 | 2016 | 0 | 0 | NaN | NaN |
| Updated On | FieldType.DATE | 2016-05-11 15:48:00+00:00 | 2016-05-27 15:45:00+00:00 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||||||||||||||
| Latitude | FieldType.DECIMAL | 41.6928 | 41.9032 | 10.0 | 7.0 | 3.0 | 0.7 | 0.0 | 0.0 | 41.6928 | 41.6928 | 41.6928 | 41.7057 | 41.7441 | 41.8634 | 41.9032 | 41.9032 | 41.9032 | 41.78 | 0.109695 | 0.012033 | 0.292478 | -2.33333 |
| Longitude | FieldType.DECIMAL | -87.6764 | -87.6043 | 10.0 | 7.0 | 3.0 | 0.7 | 0.0 | 0.0 | -87.6764 | -87.6764 | -87.6764 | -87.6734 | -87.6645 | -87.6194 | -87.6043 | -87.6043 | -87.6043 | -87.6484 | 0.0386264 | 0.001492 | 0.344429 | -2.33333 |
| Location | FieldType.STRING | (41.903206037, -87.676361925) | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 7.0 |
Training with Dataset
Now that you have registered your Dataset, you can call up the Dataset and convert it to pandas DataFrame or Spark DataFrame easily in your train.py script.
# Sample train.py script
import azureml.core
import pandas as pd
import datetime
import shutil
from azureml.core import Workspace, Datastore, Dataset, Experiment, Run
from sklearn.model_selection import train_test_split
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from sklearn.tree import DecisionTreeClassifier
run = Run.get_context()
workspace = run.experiment.workspace
# Access Dataset registered with the workspace by name
dataset_name = 'training_data'
dataset = Dataset.get(workspace=workspace, name=dataset_name)
ds_def = dataset.get_definition()
dataset_val, dataset_train = ds_def.random_split(percentage=0.3)
y_df = dataset_train.keep_columns(['HasDetections']).to_pandas_dataframe()
x_df = dataset_train.drop_columns(['HasDetections']).to_pandas_dataframe()
y_val = dataset_val.keep_columns(['HasDetections']).to_pandas_dataframe()
x_val = dataset_val.drop_columns(['HasDetections']).to_pandas_dataframe()
data = {"train": {"X": x_df, "y": y_df},
"validation": {"X": x_val, "y": y_val}}
clf = DecisionTreeClassifier().fit(data["train"]["X"], data["train"]["y"])
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_df, y_df)))
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'.format(clf.score(x_val, y_val)))
For an end-to-end tutorial, you may refer to Dataset tutorial. You will learn how to:
- Explore and prepare data for training the model.
- Register the Dataset in your workspace for easy access in training.
- Take snapshots of data to ensure models can be trained with the same data every time.
- Use registered Dataset in your training script.
- Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts.