MLN repo autocleanup

This commit is contained in:
Vitaliy Zhurba
2019-07-12 10:27:43 -04:00
parent 4170a394ed
commit e792ba8278
463 changed files with 159041 additions and 189709 deletions

View File

@@ -1,159 +1,159 @@
# Azure Machine Learning Datasets (preview)
Azure Machine Learning Datasets (preview) make it easier to access and work with your data. Datasets manage data in various scenarios such as model training and pipeline creation. Using the Azure Machine Learning SDK, you can access underlying storage, explore and prepare data, manage the life cycle of different Dataset definitions, and compare between Datasets used in training and in production.
## Create and Register Datasets
It's easy to create Datasets from either local files, or Azure Datastores.
```Python
from azureml.core.workspace import Workspace
from azureml.core.datastore import Datastore
from azureml.core.dataset import Dataset
datastore_name = 'your datastore name'
# get existing workspace
workspace = Workspace.from_config()
# get Datastore from the workspace
dstore = Datastore.get(workspace, datastore_name)
# create an in-memory Dataset on your local machine
dataset = Dataset.from_delimited_files(dstore.path('data/src/crime.csv'))
```
To consume Datasets across various scenarios in Azure Machine Learning service such as automated machine learning, model training and pipeline creation, you need to register the Datasets with your workspace. By doing so, you can also share and reuse the Datasets within your organization.
```Python
dataset = dataset.register(workspace = workspace,
name = 'dataset_crime',
description = 'Training data'
)
```
## Sampling
Sampling can be particular useful with Datasets that are too large to efficiently analyze in full. It enables data scientists to work with a manageable amount of data to build and train machine learning models. At this time, the [`sample()`](https://docs.microsoft.com//python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py#sample-sample-strategy--arguments-) method from the Dataset class supports Top N, Simple Random, and Stratified sampling strategies.
After sampling, you can convert your sampled Dataset to pandas DataFrame for training. By using the native [`sample()`](https://docs.microsoft.com//python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py#sample-sample-strategy--arguments-) method from the Dataset class, you will load the sampled data on the fly instead of loading full data into memory.
### Top N sample
For Top N sampling, the first n records of your Dataset are your sample. This is helpful if you are just trying to get an idea of what your data records look like or to see what fields are in your data.
```Python
top_n_sample_dataset = dataset.sample('top_n', {'n': 5})
top_n_sample_dataset.to_pandas_dataframe()
```
### Simple random sample
In Simple Random sampling, every member of the data population has an equal chance of being selected as a part of the sample. In the `simple_random` sample strategy, the records from your Dataset are selected based on the probability specified and returns a modified Dataset. The seed parameter is optional.
```Python
simple_random_sample_dataset = dataset.sample('simple_random', {'probability':0.3, 'seed': seed})
simple_random_sample_dataset.to_pandas_dataframe()
```
### Stratified sample
Stratified samples ensure that certain groups of a population are represented in the sample. In the `stratified` sample strategy, the population is divided into strata, or subgroups, based on similarities, and records are randomly selected from each strata according to the strata weights indicated by the `fractions` parameter.
In the following example, we group each record by the specified columns, and include said record based on the strata X weight information in `fractions`. If a strata is not specified or the record cannot be grouped, the default weight to sample is 0.
```Python
# take 50% of records with `Primary Type` as `THEFT` and 20% of records with `Primary Type` as `DECEPTIVE PRACTICE` into sample Dataset
fractions = {}
fractions[('THEFT',)] = 0.5
fractions[('DECEPTIVE PRACTICE',)] = 0.2
sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions, 'seed': seed})
sample_dataset.to_pandas_dataframe()
```
## Explore with summary statistics
Detect anomalies, missing values, or error counts with the [`get_profile()`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#get-profile-arguments-none--generate-if-not-exist-true--workspace-none--compute-target-none-) method. This function gets the profile and summary statistics of your data, which in turn helps determine the necessary data preparation operations to apply.
```Python
# get pre-calculated profile
# if there is no precalculated profile available or the precalculated profile is not up-to-date, this method will generate a new profile of the Dataset
dataset.get_profile()
```
||Type|Min|Max|Count|Missing Count|Not Missing Count|Percent missing|Error Count|Empty count|0.1% Quantile|1% Quantile|5% Quantile|25% Quantile|50% Quantile|75% Quantile|95% Quantile|99% Quantile|99.9% Quantile|Mean|Standard Deviation|Variance|Skewness|Kurtosis
-|----|---|---|-----|-------------|-----------------|---------------|-----------|-----------|-------------|-----------|-----------|------------|------------|------------|------------|------------|--------------|----|------------------|--------|--------|--------
ID|FieldType.INTEGER|1.04986e+07|1.05351e+07|10.0|0.0|10.0|0.0|0.0|0.0|1.04986e+07|1.04992e+07|1.04986e+07|1.05166e+07|1.05209e+07|1.05259e+07|1.05351e+07|1.05351e+07|1.05351e+07|1.05195e+07|12302.7|1.51358e+08|-0.495701|-1.02814
Case Number|FieldType.STRING|HZ239907|HZ278872|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Date|FieldType.DATE|2016-04-04 23:56:00+00:00|2016-04-15 17:00:00+00:00|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Block|FieldType.STRING|004XX S KILBOURN AVE|113XX S PRAIRIE AVE|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
IUCR|FieldType.INTEGER|810|1154|10.0|0.0|10.0|0.0|0.0|0.0|810|850|810|890|1136|1153|1154|1154|1154|1058.5|137.285|18847.2|-0.785501|-1.3543
Primary Type|FieldType.STRING|DECEPTIVE PRACTICE|THEFT|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Description|FieldType.STRING|BOGUS CHECK|OVER $500|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Location Description|FieldType.STRING||SCHOOL, PUBLIC, BUILDING|10.0|0.0|10.0|0.0|0.0|1.0||||||||||||||
Arrest|FieldType.BOOLEAN|False|False|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Domestic|FieldType.BOOLEAN|False|False|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Beat|FieldType.INTEGER|531|2433|10.0|0.0|10.0|0.0|0.0|0.0|531|531|531|614|1318.5|1911|2433|2433|2433|1371.1|692.094|478994|0.105418|-1.60684
District|FieldType.INTEGER|5|24|10.0|0.0|10.0|0.0|0.0|0.0|5|5|5|6|13|19|24|24|24|13.5|6.94822|48.2778|0.0930109|-1.62325
Ward|FieldType.INTEGER|1|48|10.0|0.0|10.0|0.0|0.0|0.0|1|5|1|9|22.5|40|48|48|48|24.5|16.2635|264.5|0.173723|-1.51271
Community Area|FieldType.INTEGER|4|77|10.0|0.0|10.0|0.0|0.0|0.0|4|8.5|4|24|37.5|71|77|77|77|41.2|26.6366|709.511|0.112157|-1.73379
FBI Code|FieldType.INTEGER|6|11|10.0|0.0|10.0|0.0|0.0|0.0|6|6|6|6|11|11|11|11|11|9.4|2.36643|5.6|-0.702685|-1.59582
X Coordinate|FieldType.INTEGER|1.16309e+06|1.18336e+06|10.0|7.0|3.0|0.7|0.0|0.0|1.16309e+06|1.16309e+06|1.16309e+06|1.16401e+06|1.16678e+06|1.17921e+06|1.18336e+06|1.18336e+06|1.18336e+06|1.17108e+06|10793.5|1.165e+08|0.335126|-2.33333
Y Coordinate|FieldType.INTEGER|1.8315e+06|1.908e+06|10.0|7.0|3.0|0.7|0.0|0.0|1.8315e+06|1.8315e+06|1.8315e+06|1.83614e+06|1.85005e+06|1.89352e+06|1.908e+06|1.908e+06|1.908e+06|1.86319e+06|39905.2|1.59243e+09|0.293465|-2.33333
Year|FieldType.INTEGER|2016|2016|10.0|0.0|10.0|0.0|0.0|0.0|2016|2016|2016|2016|2016|2016|2016|2016|2016|2016|0|0|NaN|NaN
Updated On|FieldType.DATE|2016-05-11 15:48:00+00:00|2016-05-27 15:45:00+00:00|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Latitude|FieldType.DECIMAL|41.6928|41.9032|10.0|7.0|3.0|0.7|0.0|0.0|41.6928|41.6928|41.6928|41.7057|41.7441|41.8634|41.9032|41.9032|41.9032|41.78|0.109695|0.012033|0.292478|-2.33333
Longitude|FieldType.DECIMAL|-87.6764|-87.6043|10.0|7.0|3.0|0.7|0.0|0.0|-87.6764|-87.6764|-87.6764|-87.6734|-87.6645|-87.6194|-87.6043|-87.6043|-87.6043|-87.6484|0.0386264|0.001492|0.344429|-2.33333
Location|FieldType.STRING||(41.903206037, -87.676361925)|10.0|0.0|10.0|0.0|0.0|7.0||||||||||||||
## Training with Dataset
Now that you have registered your Dataset, you can call up the Dataset and convert it to pandas DataFrame or Spark DataFrame easily in your train.py script.
```Python
# Sample train.py script
import azureml.core
import pandas as pd
import datetime
import shutil
from azureml.core import Workspace, Datastore, Dataset, Experiment, Run
from sklearn.model_selection import train_test_split
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from sklearn.tree import DecisionTreeClassifier
run = Run.get_context()
workspace = run.experiment.workspace
# Access Dataset registered with the workspace by name
dataset_name = 'training_data'
dataset = Dataset.get(workspace=workspace, name=dataset_name)
ds_def = dataset.get_definition()
dataset_val, dataset_train = ds_def.random_split(percentage=0.3)
y_df = dataset_train.keep_columns(['HasDetections']).to_pandas_dataframe()
x_df = dataset_train.drop_columns(['HasDetections']).to_pandas_dataframe()
y_val = dataset_val.keep_columns(['HasDetections']).to_pandas_dataframe()
x_val = dataset_val.drop_columns(['HasDetections']).to_pandas_dataframe()
data = {"train": {"X": x_df, "y": y_df},
"validation": {"X": x_val, "y": y_val}}
clf = DecisionTreeClassifier().fit(data["train"]["X"], data["train"]["y"])
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_df, y_df)))
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'.format(clf.score(x_val, y_val)))
```
For an end-to-end tutorial, you may refer to [Dataset tutorial](datasets-tutorial.ipynb). You will learn how to:
- Explore and prepare data for training the model.
- Register the Dataset in your workspace for easy access in training.
- Take snapshots of data to ensure models can be trained with the same data every time.
- Use registered Dataset in your training script.
- Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts.
# Azure Machine Learning Datasets (preview)
Azure Machine Learning Datasets (preview) make it easier to access and work with your data. Datasets manage data in various scenarios such as model training and pipeline creation. Using the Azure Machine Learning SDK, you can access underlying storage, explore and prepare data, manage the life cycle of different Dataset definitions, and compare between Datasets used in training and in production.
## Create and Register Datasets
It's easy to create Datasets from either local files, or Azure Datastores.
```Python
from azureml.core.workspace import Workspace
from azureml.core.datastore import Datastore
from azureml.core.dataset import Dataset
datastore_name = 'your datastore name'
# get existing workspace
workspace = Workspace.from_config()
# get Datastore from the workspace
dstore = Datastore.get(workspace, datastore_name)
# create an in-memory Dataset on your local machine
dataset = Dataset.from_delimited_files(dstore.path('data/src/crime.csv'))
```
To consume Datasets across various scenarios in Azure Machine Learning service such as automated machine learning, model training and pipeline creation, you need to register the Datasets with your workspace. By doing so, you can also share and reuse the Datasets within your organization.
```Python
dataset = dataset.register(workspace = workspace,
name = 'dataset_crime',
description = 'Training data'
)
```
## Sampling
Sampling can be particular useful with Datasets that are too large to efficiently analyze in full. It enables data scientists to work with a manageable amount of data to build and train machine learning models. At this time, the [`sample()`](https://docs.microsoft.com//python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py#sample-sample-strategy--arguments-) method from the Dataset class supports Top N, Simple Random, and Stratified sampling strategies.
After sampling, you can convert your sampled Dataset to pandas DataFrame for training. By using the native [`sample()`](https://docs.microsoft.com//python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py#sample-sample-strategy--arguments-) method from the Dataset class, you will load the sampled data on the fly instead of loading full data into memory.
### Top N sample
For Top N sampling, the first n records of your Dataset are your sample. This is helpful if you are just trying to get an idea of what your data records look like or to see what fields are in your data.
```Python
top_n_sample_dataset = dataset.sample('top_n', {'n': 5})
top_n_sample_dataset.to_pandas_dataframe()
```
### Simple random sample
In Simple Random sampling, every member of the data population has an equal chance of being selected as a part of the sample. In the `simple_random` sample strategy, the records from your Dataset are selected based on the probability specified and returns a modified Dataset. The seed parameter is optional.
```Python
simple_random_sample_dataset = dataset.sample('simple_random', {'probability':0.3, 'seed': seed})
simple_random_sample_dataset.to_pandas_dataframe()
```
### Stratified sample
Stratified samples ensure that certain groups of a population are represented in the sample. In the `stratified` sample strategy, the population is divided into strata, or subgroups, based on similarities, and records are randomly selected from each strata according to the strata weights indicated by the `fractions` parameter.
In the following example, we group each record by the specified columns, and include said record based on the strata X weight information in `fractions`. If a strata is not specified or the record cannot be grouped, the default weight to sample is 0.
```Python
# take 50% of records with `Primary Type` as `THEFT` and 20% of records with `Primary Type` as `DECEPTIVE PRACTICE` into sample Dataset
fractions = {}
fractions[('THEFT',)] = 0.5
fractions[('DECEPTIVE PRACTICE',)] = 0.2
sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions, 'seed': seed})
sample_dataset.to_pandas_dataframe()
```
## Explore with summary statistics
Detect anomalies, missing values, or error counts with the [`get_profile()`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#get-profile-arguments-none--generate-if-not-exist-true--workspace-none--compute-target-none-) method. This function gets the profile and summary statistics of your data, which in turn helps determine the necessary data preparation operations to apply.
```Python
# get pre-calculated profile
# if there is no precalculated profile available or the precalculated profile is not up-to-date, this method will generate a new profile of the Dataset
dataset.get_profile()
```
||Type|Min|Max|Count|Missing Count|Not Missing Count|Percent missing|Error Count|Empty count|0.1% Quantile|1% Quantile|5% Quantile|25% Quantile|50% Quantile|75% Quantile|95% Quantile|99% Quantile|99.9% Quantile|Mean|Standard Deviation|Variance|Skewness|Kurtosis
-|----|---|---|-----|-------------|-----------------|---------------|-----------|-----------|-------------|-----------|-----------|------------|------------|------------|------------|------------|--------------|----|------------------|--------|--------|--------
ID|FieldType.INTEGER|1.04986e+07|1.05351e+07|10.0|0.0|10.0|0.0|0.0|0.0|1.04986e+07|1.04992e+07|1.04986e+07|1.05166e+07|1.05209e+07|1.05259e+07|1.05351e+07|1.05351e+07|1.05351e+07|1.05195e+07|12302.7|1.51358e+08|-0.495701|-1.02814
Case Number|FieldType.STRING|HZ239907|HZ278872|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Date|FieldType.DATE|2016-04-04 23:56:00+00:00|2016-04-15 17:00:00+00:00|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Block|FieldType.STRING|004XX S KILBOURN AVE|113XX S PRAIRIE AVE|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
IUCR|FieldType.INTEGER|810|1154|10.0|0.0|10.0|0.0|0.0|0.0|810|850|810|890|1136|1153|1154|1154|1154|1058.5|137.285|18847.2|-0.785501|-1.3543
Primary Type|FieldType.STRING|DECEPTIVE PRACTICE|THEFT|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Description|FieldType.STRING|BOGUS CHECK|OVER $500|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Location Description|FieldType.STRING||SCHOOL, PUBLIC, BUILDING|10.0|0.0|10.0|0.0|0.0|1.0||||||||||||||
Arrest|FieldType.BOOLEAN|False|False|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Domestic|FieldType.BOOLEAN|False|False|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Beat|FieldType.INTEGER|531|2433|10.0|0.0|10.0|0.0|0.0|0.0|531|531|531|614|1318.5|1911|2433|2433|2433|1371.1|692.094|478994|0.105418|-1.60684
District|FieldType.INTEGER|5|24|10.0|0.0|10.0|0.0|0.0|0.0|5|5|5|6|13|19|24|24|24|13.5|6.94822|48.2778|0.0930109|-1.62325
Ward|FieldType.INTEGER|1|48|10.0|0.0|10.0|0.0|0.0|0.0|1|5|1|9|22.5|40|48|48|48|24.5|16.2635|264.5|0.173723|-1.51271
Community Area|FieldType.INTEGER|4|77|10.0|0.0|10.0|0.0|0.0|0.0|4|8.5|4|24|37.5|71|77|77|77|41.2|26.6366|709.511|0.112157|-1.73379
FBI Code|FieldType.INTEGER|6|11|10.0|0.0|10.0|0.0|0.0|0.0|6|6|6|6|11|11|11|11|11|9.4|2.36643|5.6|-0.702685|-1.59582
X Coordinate|FieldType.INTEGER|1.16309e+06|1.18336e+06|10.0|7.0|3.0|0.7|0.0|0.0|1.16309e+06|1.16309e+06|1.16309e+06|1.16401e+06|1.16678e+06|1.17921e+06|1.18336e+06|1.18336e+06|1.18336e+06|1.17108e+06|10793.5|1.165e+08|0.335126|-2.33333
Y Coordinate|FieldType.INTEGER|1.8315e+06|1.908e+06|10.0|7.0|3.0|0.7|0.0|0.0|1.8315e+06|1.8315e+06|1.8315e+06|1.83614e+06|1.85005e+06|1.89352e+06|1.908e+06|1.908e+06|1.908e+06|1.86319e+06|39905.2|1.59243e+09|0.293465|-2.33333
Year|FieldType.INTEGER|2016|2016|10.0|0.0|10.0|0.0|0.0|0.0|2016|2016|2016|2016|2016|2016|2016|2016|2016|2016|0|0|NaN|NaN
Updated On|FieldType.DATE|2016-05-11 15:48:00+00:00|2016-05-27 15:45:00+00:00|10.0|0.0|10.0|0.0|0.0|0.0||||||||||||||
Latitude|FieldType.DECIMAL|41.6928|41.9032|10.0|7.0|3.0|0.7|0.0|0.0|41.6928|41.6928|41.6928|41.7057|41.7441|41.8634|41.9032|41.9032|41.9032|41.78|0.109695|0.012033|0.292478|-2.33333
Longitude|FieldType.DECIMAL|-87.6764|-87.6043|10.0|7.0|3.0|0.7|0.0|0.0|-87.6764|-87.6764|-87.6764|-87.6734|-87.6645|-87.6194|-87.6043|-87.6043|-87.6043|-87.6484|0.0386264|0.001492|0.344429|-2.33333
Location|FieldType.STRING||(41.903206037, -87.676361925)|10.0|0.0|10.0|0.0|0.0|7.0||||||||||||||
## Training with Dataset
Now that you have registered your Dataset, you can call up the Dataset and convert it to pandas DataFrame or Spark DataFrame easily in your train.py script.
```Python
# Sample train.py script
import azureml.core
import pandas as pd
import datetime
import shutil
from azureml.core import Workspace, Datastore, Dataset, Experiment, Run
from sklearn.model_selection import train_test_split
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from sklearn.tree import DecisionTreeClassifier
run = Run.get_context()
workspace = run.experiment.workspace
# Access Dataset registered with the workspace by name
dataset_name = 'training_data'
dataset = Dataset.get(workspace=workspace, name=dataset_name)
ds_def = dataset.get_definition()
dataset_val, dataset_train = ds_def.random_split(percentage=0.3)
y_df = dataset_train.keep_columns(['HasDetections']).to_pandas_dataframe()
x_df = dataset_train.drop_columns(['HasDetections']).to_pandas_dataframe()
y_val = dataset_val.keep_columns(['HasDetections']).to_pandas_dataframe()
x_val = dataset_val.drop_columns(['HasDetections']).to_pandas_dataframe()
data = {"train": {"X": x_df, "y": y_df},
"validation": {"X": x_val, "y": y_val}}
clf = DecisionTreeClassifier().fit(data["train"]["X"], data["train"]["y"])
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_df, y_df)))
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'.format(clf.score(x_val, y_val)))
```
For an end-to-end tutorial, you may refer to [Dataset tutorial](datasets-tutorial.ipynb). You will learn how to:
- Explore and prepare data for training the model.
- Register the Dataset in your workspace for easy access in training.
- Take snapshots of data to ensure models can be trained with the same data every time.
- Use registered Dataset in your training script.
- Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts.
![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/datasets/README.png)

View File

@@ -1,376 +1,376 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Learn how to use Datasets in Azure ML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, you learn how to use Azure ML Datasets to train a regression model with the Azure Machine Learning SDK for Python. You will\n",
"\n",
"* Explore and prepare data for training the model\n",
"* Register the Dataset in your workspace to share it with others\n",
"* Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, you:\n",
"\n",
"☑ Setup a Python environment and import packages\n",
"\n",
"☑ Load the Titanic data from your Azure Blob Storage. (The [original data](https://www.kaggle.com/c/titanic/data) can be found on Kaggle)\n",
"\n",
"☑ Explore and cleanse the data to remove anomalies\n",
"\n",
"☑ Register the Dataset in your workspace, allowing you to use it in model training \n",
"\n",
"☑ Make changes to the dataset's definition without breaking the production model or the daily data pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-requisites:\n",
"Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. To run the notebook you will need:\n",
"\n",
"A Python 3.6 notebook server with the following installed:\n",
"* The Azure Machine Learning SDK for Python\n",
"* The Azure Machine Learning Data Prep SDK for Python\n",
"* The tutorial notebook\n",
"\n",
"Data and train.py script to store in your Azure Blob Storage Account.\n",
" * [Titanic data](./train-dataset/Titanic.csv)\n",
" * [train.py](./train-dataset/train.py)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create and register Datasets you need:\n",
"\n",
" * An Azure subscription. If you don\u00e2\u20ac\u2122t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today.\n",
"\n",
" * An Azure Machine Learning service workspace. See the [Create an Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace?branch=release-build-amls).\n",
"\n",
" * The Azure Machine Learning SDK for Python (version 1.0.21 or later). To install or update to the latest version of the SDK, see [Install or update the SDK](https://docs.microsoft.com/python/api/overview/azure/ml/install?view=azure-ml-py).\n",
"\n",
"\n",
"For more information on how to set up your workspace, see the [Create an Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace?branch=release-build-amls).\n",
"\n",
"The first part that needs to be done is setting up your python environment. You will need to import all of your python packages including `azureml.dataprep` and `azureml.core.dataset`. Then access your workspace through your Azure subscription and set up your compute target. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep\n",
"import azureml.core\n",
"import pandas as pd\n",
"import logging\n",
"import os\n",
"import shutil\n",
"from azureml.core import Workspace, Datastore, Dataset\n",
"\n",
"# Get existing workspace from config.json file in the same folder as the tutorial notebook\n",
"# You can download the config file from your workspace\n",
"workspace = Workspace.from_config()\n",
"print(\"Workspace\")\n",
"print(workspace)\n",
"print(\"Compute targets\")\n",
"print(workspace.compute_targets)\n",
"\n",
"# Get compute target that has already been attached to the workspace\n",
"# Pick the right compute target from the list of computes attached to your workspace\n",
"\n",
"compute_target_name = 'datasetBugBash'\n",
"remote_compute_target = workspace.compute_targets[compute_target_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To load data to your dataset, you will access the data through your datastore. After you create your dataset, you can use `get_profile()` to see your data's statistics.\n",
"\n",
"We will now upload the [original data](https://www.kaggle.com/c/titanic/data) to the default datastore(blob) within your workspace.."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore = workspace.get_default_datastore()\n",
"datastore.upload_files(files=['./train-dataset/Titanic.csv'],\n",
" target_path='train-dataset/',\n",
" overwrite=True,\n",
" show_progress=True)\n",
"\n",
"dataset = Dataset.auto_read_files(path=datastore.path('train-dataset/Titanic.csv'))\n",
"\n",
"#Display Dataset Profile of the Titanic Dataset\n",
"dataset.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To predict if a person survived the Titanic's sinking or not, the columns that are relevant to train the model are 'Survived','Pclass', 'Sex','SibSp', and 'Parch'. You can update your dataset's deinition and only keep these columns you will need. You will also need to convert values (\"male\",\"female\") in the \"Sex\" column to 0 or 1, because the algorithm in the train.py file will be using numeric values instead of strings.\n",
"\n",
"For more examples of preparing data with Datasets, see [Explore and prepare data with the Dataset class](aka.ms/azureml/howto/exploreandpreparedata)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds_def = dataset.get_definition()\n",
"ds_def = ds_def.keep_columns(['Survived','Pclass', 'Sex','SibSp', 'Parch', 'Fare'])\n",
"ds_def = ds_def.replace('Sex','male', 0)\n",
"ds_def = ds_def.replace('Sex','female', 1)\n",
"ds_def.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you have cleaned your data, you can register your dataset in your workspace. \n",
"\n",
"Registering your dataset allows you to easily have access to your processed data and share it with other people in your organization using the same workspace. It can be accessed in any notebook or script that is connected to your workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.update_definition(ds_def, 'Cleaned Data')\n",
"\n",
"dataset.generate_profile(compute_target='local').get_result()\n",
"\n",
"dataset_name = 'clean_Titanic_tutorial'\n",
"dataset = dataset.register(workspace=workspace,\n",
" name=dataset_name,\n",
" description='training dataset',\n",
" tags = {'year':'2019', 'month':'Apr'},\n",
" exist_ok=True)\n",
"workspace.datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The following code snippet will train your model locally using the train.py script."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment, RunConfiguration\n",
"\n",
"experiment_name = 'training-datasets'\n",
"experiment = Experiment(workspace = workspace, name = experiment_name)\n",
"project_folder = './train-dataset/'\n",
"\n",
"# create a new RunConfig object\n",
"run_config = RunConfiguration()\n",
"\n",
"run_config.environment.python.user_managed_dependencies = True\n",
"\n",
"from azureml.core import Run\n",
"from azureml.core import ScriptRunConfig\n",
"\n",
"src = ScriptRunConfig(source_directory=project_folder, \n",
" script='train.py', \n",
" run_config=run_config) \n",
"run = experiment.submit(config=src)\n",
"run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also use the same script with your dataset for your Pipeline's Python Script Step.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.core import Pipeline, PipelineData\n",
"from azureml.pipeline.steps import PythonScriptStep\n",
"from azureml.data.data_reference import DataReference\n",
"\n",
"trainStep = PythonScriptStep(script_name=\"train.py\",\n",
" compute_target=remote_compute_target,\n",
" source_directory=project_folder)\n",
"\n",
"pipeline = Pipeline(workspace=workspace,\n",
" steps=trainStep)\n",
"\n",
"pipeline_run = experiment.submit(pipeline)\n",
"pipeline_run.wait_for_completion()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can make changes to the dataset's definition without breaking the production model or the daily data pipeline. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can call get_definitions to see that there are several versions. After each change to a dataset's version, another one is added."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset.get_definitions()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset = Dataset.get(workspace=workspace, name=dataset_name)\n",
"ds_def = dataset.get_definition()\n",
"ds_def = ds_def.drop_columns(['Fare'])\n",
"dataset = dataset.update_definition(ds_def, 'Dropping Fare as PClass and Fare are strongly correlated')\n",
"dataset.generate_profile(compute_target='local').get_result()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataset definitions can be deprecated when usage is no longer recommended and a replacement is available. When a deprecated dataset definition is used in an AML Experimentation/Pipeline scenario, a warning message gets returned but execution will not be blocked. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Deprecate dataset definition 1 by the 2nd definition\n",
"ds_def = dataset.get_definition('1')\n",
"ds_def.deprecate(deprecate_by_dataset_id=dataset._id, deprecated_by_definition_version='2')\n",
"dataset.get_definitions()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataset definitions can be archived when definitions are not supposed to be used for any reasons (such as underlying data no longer available). When an archived dataset definition is used in an AML Experimentation/Pipeline scenario, execution will be blocked with error. No further actions can be performed on archived Dataset definitions, but the references will be kept intact. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Archive the deprecated dataset definition #1\n",
"ds_def = dataset.get_definition('1')\n",
"ds_def.archive()\n",
"dataset.get_definitions()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also reactivate any defition that you archived for later use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds_def = dataset.get_definition('1')\n",
"ds_def.reactivate()\n",
"dataset.get_definitions()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You have now finished using a dataset from start to finish of your experiment!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/datasets/datasets-tutorial/datasets-tutorial.png)"
]
}
],
"metadata": {
"authors": [
{
"name": "cforbe"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
},
"nbformat": 4,
"nbformat_minor": 2
{
"metadata": {
"kernelspec": {
"display_name": "Python 3.6",
"name": "python36",
"language": "python"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
"authors": [
{
"name": "cforbe"
}
],
"language_info": {
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"name": "python",
"file_extension": ".py",
"nbconvert_exporter": "python",
"version": "3.6.4"
}
},
"nbformat": 4,
"cells": [
{
"metadata": {},
"source": [
"# Tutorial: Learn how to use Datasets in Azure ML"
],
"cell_type": "markdown"
},
{
"metadata": {},
"source": [
"In this tutorial, you learn how to use Azure ML Datasets to train a regression model with the Azure Machine Learning SDK for Python. You will\n",
"\n",
"* Explore and prepare data for training the model\n",
"* Register the Dataset in your workspace to share it with others\n",
"* Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts\n"
],
"cell_type": "markdown"
},
{
"metadata": {},
"source": [
"In this tutorial, you:\n",
"\n",
"☑ Setup a Python environment and import packages\n",
"\n",
"☑ Load the Titanic data from your Azure Blob Storage. (The [original data](https://www.kaggle.com/c/titanic/data) can be found on Kaggle)\n",
"\n",
"☑ Explore and cleanse the data to remove anomalies\n",
"\n",
"☑ Register the Dataset in your workspace, allowing you to use it in model training \n",
"\n",
"☑ Make changes to the dataset's definition without breaking the production model or the daily data pipeline"
],
"cell_type": "markdown"
},
{
"metadata": {},
"source": [
"## Pre-requisites:\n",
"Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. To run the notebook you will need:\n",
"\n",
"A Python 3.6 notebook server with the following installed:\n",
"* The Azure Machine Learning SDK for Python\n",
"* The Azure Machine Learning Data Prep SDK for Python\n",
"* The tutorial notebook\n",
"\n",
"Data and train.py script to store in your Azure Blob Storage Account.\n",
" * [Titanic data](./train-dataset/Titanic.csv)\n",
" * [train.py](./train-dataset/train.py)"
],
"cell_type": "markdown"
},
{
"metadata": {},
"source": [
"To create and register Datasets you need:\n",
"\n",
" * An Azure subscription. If you don\u00e2\u20ac\u2122t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today.\n",
"\n",
" * An Azure Machine Learning service workspace. See the [Create an Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace?branch=release-build-amls).\n",
"\n",
" * The Azure Machine Learning SDK for Python (version 1.0.21 or later). To install or update to the latest version of the SDK, see [Install or update the SDK](https://docs.microsoft.com/python/api/overview/azure/ml/install?view=azure-ml-py).\n",
"\n",
"\n",
"For more information on how to set up your workspace, see the [Create an Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace?branch=release-build-amls).\n",
"\n",
"The first part that needs to be done is setting up your python environment. You will need to import all of your python packages including `azureml.dataprep` and `azureml.core.dataset`. Then access your workspace through your Azure subscription and set up your compute target. "
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"import azureml.dataprep as dprep\n",
"import azureml.core\n",
"import pandas as pd\n",
"import logging\n",
"import os\n",
"import shutil\n",
"from azureml.core import Workspace, Datastore, Dataset\n",
"\n",
"# Get existing workspace from config.json file in the same folder as the tutorial notebook\n",
"# You can download the config file from your workspace\n",
"workspace = Workspace.from_config()\n",
"print(\"Workspace\")\n",
"print(workspace)\n",
"print(\"Compute targets\")\n",
"print(workspace.compute_targets)\n",
"\n",
"# Get compute target that has already been attached to the workspace\n",
"# Pick the right compute target from the list of computes attached to your workspace\n",
"\n",
"compute_target_name = 'datasetBugBash'\n",
"remote_compute_target = workspace.compute_targets[compute_target_name]"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"To load data to your dataset, you will access the data through your datastore. After you create your dataset, you can use `get_profile()` to see your data's statistics.\n",
"\n",
"We will now upload the [original data](https://www.kaggle.com/c/titanic/data) to the default datastore(blob) within your workspace.."
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"datastore = workspace.get_default_datastore()\n",
"datastore.upload_files(files=['./train-dataset/Titanic.csv'],\n",
" target_path='train-dataset/',\n",
" overwrite=True,\n",
" show_progress=True)\n",
"\n",
"dataset = Dataset.auto_read_files(path=datastore.path('train-dataset/Titanic.csv'))\n",
"\n",
"#Display Dataset Profile of the Titanic Dataset\n",
"dataset.get_profile()"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"To predict if a person survived the Titanic's sinking or not, the columns that are relevant to train the model are 'Survived','Pclass', 'Sex','SibSp', and 'Parch'. You can update your dataset's deinition and only keep these columns you will need. You will also need to convert values (\"male\",\"female\") in the \"Sex\" column to 0 or 1, because the algorithm in the train.py file will be using numeric values instead of strings.\n",
"\n",
"For more examples of preparing data with Datasets, see [Explore and prepare data with the Dataset class](aka.ms/azureml/howto/exploreandpreparedata)."
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"ds_def = dataset.get_definition()\n",
"ds_def = ds_def.keep_columns(['Survived','Pclass', 'Sex','SibSp', 'Parch', 'Fare'])\n",
"ds_def = ds_def.replace('Sex','male', 0)\n",
"ds_def = ds_def.replace('Sex','female', 1)\n",
"ds_def.head(5)"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"Once you have cleaned your data, you can register your dataset in your workspace. \n",
"\n",
"Registering your dataset allows you to easily have access to your processed data and share it with other people in your organization using the same workspace. It can be accessed in any notebook or script that is connected to your workspace."
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"dataset = dataset.update_definition(ds_def, 'Cleaned Data')\n",
"\n",
"dataset.generate_profile(compute_target='local').get_result()\n",
"\n",
"dataset_name = 'clean_Titanic_tutorial'\n",
"dataset = dataset.register(workspace=workspace,\n",
" name=dataset_name,\n",
" description='training dataset',\n",
" tags = {'year':'2019', 'month':'Apr'},\n",
" exist_ok=True)\n",
"workspace.datasets"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"\n",
"The following code snippet will train your model locally using the train.py script."
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"from azureml.core import Experiment, RunConfiguration\n",
"\n",
"experiment_name = 'training-datasets'\n",
"experiment = Experiment(workspace = workspace, name = experiment_name)\n",
"project_folder = './train-dataset/'\n",
"\n",
"# create a new RunConfig object\n",
"run_config = RunConfiguration()\n",
"\n",
"run_config.environment.python.user_managed_dependencies = True\n",
"\n",
"from azureml.core import Run\n",
"from azureml.core import ScriptRunConfig\n",
"\n",
"src = ScriptRunConfig(source_directory=project_folder, \n",
" script='train.py', \n",
" run_config=run_config) \n",
"run = experiment.submit(config=src)\n",
"run.wait_for_completion(show_output=True)"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"You can also use the same script with your dataset for your Pipeline's Python Script Step.\n"
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"from azureml.pipeline.core import Pipeline, PipelineData\n",
"from azureml.pipeline.steps import PythonScriptStep\n",
"from azureml.data.data_reference import DataReference\n",
"\n",
"trainStep = PythonScriptStep(script_name=\"train.py\",\n",
" compute_target=remote_compute_target,\n",
" source_directory=project_folder)\n",
"\n",
"pipeline = Pipeline(workspace=workspace,\n",
" steps=trainStep)\n",
"\n",
"pipeline_run = experiment.submit(pipeline)\n",
"pipeline_run.wait_for_completion()"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"You can make changes to the dataset's definition without breaking the production model or the daily data pipeline. "
],
"cell_type": "markdown"
},
{
"metadata": {},
"source": [
"You can call get_definitions to see that there are several versions. After each change to a dataset's version, another one is added."
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"dataset.get_definitions()"
],
"cell_type": "code"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"dataset = Dataset.get(workspace=workspace, name=dataset_name)\n",
"ds_def = dataset.get_definition()\n",
"ds_def = ds_def.drop_columns(['Fare'])\n",
"dataset = dataset.update_definition(ds_def, 'Dropping Fare as PClass and Fare are strongly correlated')\n",
"dataset.generate_profile(compute_target='local').get_result()"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"Dataset definitions can be deprecated when usage is no longer recommended and a replacement is available. When a deprecated dataset definition is used in an AML Experimentation/Pipeline scenario, a warning message gets returned but execution will not be blocked. "
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"# Deprecate dataset definition 1 by the 2nd definition\n",
"ds_def = dataset.get_definition('1')\n",
"ds_def.deprecate(deprecate_by_dataset_id=dataset._id, deprecated_by_definition_version='2')\n",
"dataset.get_definitions()"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"Dataset definitions can be archived when definitions are not supposed to be used for any reasons (such as underlying data no longer available). When an archived dataset definition is used in an AML Experimentation/Pipeline scenario, execution will be blocked with error. No further actions can be performed on archived Dataset definitions, but the references will be kept intact. "
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"# Archive the deprecated dataset definition #1\n",
"ds_def = dataset.get_definition('1')\n",
"ds_def.archive()\n",
"dataset.get_definitions()"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"You can also reactivate any defition that you archived for later use."
],
"cell_type": "markdown"
},
{
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"ds_def = dataset.get_definition('1')\n",
"ds_def.reactivate()\n",
"dataset.get_definitions()"
],
"cell_type": "code"
},
{
"metadata": {},
"source": [
"You have now finished using a dataset from start to finish of your experiment!"
],
"cell_type": "markdown"
},
{
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/datasets/datasets-tutorial/datasets-tutorial.png)"
],
"cell_type": "markdown"
}
],
"nbformat_minor": 2
}

View File

@@ -1,37 +1,37 @@
import azureml.dataprep as dprep
import azureml.core
import pandas as pd
import logging
import os
import datetime
import shutil
from azureml.core import Workspace, Datastore, Dataset, Experiment, Run
from sklearn.model_selection import train_test_split
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from sklearn.tree import DecisionTreeClassifier
run = Run.get_context()
workspace = run.experiment.workspace
dataset_name = 'clean_Titanic_tutorial'
dataset = Dataset.get(workspace=workspace, name=dataset_name)
df = dataset.to_pandas_dataframe()
x_col = ['Pclass', 'Sex', 'SibSp', 'Parch']
y_col = ['Survived']
x_df = df.loc[:, x_col]
y_df = df.loc[:, y_col]
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)
data = {"train": {"X": x_train, "y": y_train},
"test": {"X": x_test, "y": y_test}}
clf = DecisionTreeClassifier().fit(data["train"]["X"], data["train"]["y"])
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(x_test, y_test)))
import azureml.dataprep as dprep
import azureml.core
import pandas as pd
import logging
import os
import datetime
import shutil
from azureml.core import Workspace, Datastore, Dataset, Experiment, Run
from sklearn.model_selection import train_test_split
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from sklearn.tree import DecisionTreeClassifier
run = Run.get_context()
workspace = run.experiment.workspace
dataset_name = 'clean_Titanic_tutorial'
dataset = Dataset.get(workspace=workspace, name=dataset_name)
df = dataset.to_pandas_dataframe()
x_col = ['Pclass', 'Sex', 'SibSp', 'Parch']
y_col = ['Survived']
x_df = df.loc[:, x_col]
y_df = df.loc[:, y_col]
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)
data = {"train": {"X": x_train, "y": y_train},
"test": {"X": x_test, "y": y_test}}
clf = DecisionTreeClassifier().fit(data["train"]["X"], data["train"]["y"])
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(x_test, y_test)))