Files
MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/tutorials/getting-started/getting-started.ipynb
Roope Astala db6ae67940 version 1.0.41
2019-05-29 10:59:59 -04:00

442 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started with Azure ML Data Prep SDK\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Wonder how you can make the most of the Azure ML Data Prep SDK? In this \"Getting Started\" guide, we'll demonstrate how to do your normal data wrangling with this SDK and showcase a few highlights that make this SDK shine. Using a sample of this [Kaggle crime dataset](https://www.kaggle.com/currie32/crimes-in-chicago/home) as an example, we'll cover how to:\n",
"\n",
"* [Read in data](#Read)\n",
"* [Profile your data](#Profile)\n",
"* [Append rows](#Append)\n",
"* [Apply common data science transforms](#Data-science-transforms)\n",
" * [Summarize](#Summarize)\n",
" * [Join](#Join)\n",
" * [Filter](#Filter)\n",
" * [Replace](#Replace)\n",
"* [Consume your cleaned dataset](#Consume)\n",
"* [Explore advanced features](#Explore)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display\n",
"from os import path\n",
"from tempfile import mkdtemp\n",
"\n",
"import pandas as pd\n",
"import azureml.dataprep as dprep\n",
"\n",
"# Paths for datasets\n",
"file_crime_dirty = '../../data/crime-dirty.csv'\n",
"file_crime_spring = '../../data/crime-spring.csv'\n",
"file_crime_winter = '../../data/crime-winter.csv'\n",
"file_aldermen = '../../data/chicago-aldermen-2015.csv'\n",
"\n",
"# Seed\n",
"RAND_SEED = 7251"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Read\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read in data\n",
"\n",
"Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet) and the ability to infer column types automatically. To see how powerful the `auto_read_file` capability is, let's take a peek at the `dirty-crime.csv`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dprep.read_csv(path=file_crime_dirty).head(7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common occurrence in many datasets is to have a column of values with commas; in our case, the last column represents location in the form of longitude-latitude pair. The default CSV reader interprets this comma as a delimiter and thus splits the data into two columns. Furthermore, it incorrectly reads in the header as the column name. Normally, we would need to `skip` the header and specify the delimiter as `|`, but our `auto_read_file` eliminates that work:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crime_dirty = dprep.auto_read_file(path=file_crime_dirty)\n",
"\n",
"crime_dirty.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Advanced features:__ if you'd like to specify the file type and adjust how you want to read files in, you can see the list of our specialized file readers and how to use them [here](../../how-to-guides/data-ingestion.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Profile\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profile your data\n",
"\n",
"Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics. Notice that our auto file reader automatically guessed the column type:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crime_dirty.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Append\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Append rows\n",
"\n",
"What if your data is split across multiple files? We support the ability to append multiple datasets column-wise and row-wise. Here, we demonstrate how you can coalesce datasets row-wise:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Datasets with the same schema as crime_dirty\n",
"crime_winter = dprep.auto_read_file(path=file_crime_winter)\n",
"crime_spring = dprep.auto_read_file(path=file_crime_spring)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crime = (crime_dirty.append_rows(dataflows=[crime_winter, crime_spring]))\n",
"\n",
"crime.take_sample(probability=0.25, seed=RAND_SEED).head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Advanced features:__ you can learn how to append column-wise and how to deal with appending data with different schemas [here](../../how-to-guides/append-columns-and-rows.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Data-science-transforms\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply common data science transforms\n",
"\n",
"Azure ML Data Prep supports almost all common data science transforms found in other industry-standard data science libraries. Here, we'll explore the ability to `summarize`, `join`, `filter`, and `replace`. \n",
"\n",
"__Advanced features:__\n",
"* We also provide \"smart\" transforms not found in pandas that use machine learning to [derive new columns](../../how-to-guides/derive-column-by-example.ipynb), [split columns](../../how-to-guides/split-column-by-example.ipynb), and [fuzzy grouping](../../how-to-guides/fuzzy-group.ipynb).\n",
"* Finally, we also help featurize your dataset to prepare it for machine learning; learn more about our featurizers like [one-hot encoder](../../how-to-guides/one-hot-encoder.ipynb), [label encoder](../../how-to-guides/label-encoder.ipynb), [min-max scaler](../../how-to-guides/min-max-scaler.ipynb), and [random (train-test) split](../../how-to-guides/random-split.ipynb).\n",
"* Our complete list of example Notebooks for transforms can be found in our [How-to Guides](../../how-to-guides)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Summarize\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summarize\n",
"\n",
"Let's see which wards had the most crimes in our sample dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crime_summary = (crime\n",
" .summarize(\n",
" summary_columns=[\n",
" dprep.SummaryColumnsValue(\n",
" column_id='ID', \n",
" summary_column_name='total_ward_crimes', \n",
" summary_function=dprep.SummaryFunction.COUNT\n",
" )\n",
" ],\n",
" group_by_columns=['Ward']\n",
" )\n",
")\n",
"\n",
"(crime_summary\n",
" .sort(sort_order=[('total_ward_crimes', True)])\n",
" .head(5)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Join\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Join\n",
"\n",
"Let's annotate each observation with more information about the ward where the crime occurred. Let's do so by joining `crime` with a dataset which lists the current aldermen for each ward:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"aldermen = dprep.auto_read_file(path=file_aldermen)\n",
"\n",
"aldermen.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crime.join(\n",
" left_dataflow=crime,\n",
" right_dataflow=aldermen,\n",
" join_key_pairs=[\n",
" ('Ward', 'Ward')\n",
" ]\n",
").head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Advanced features:__ [Learn more](../../how-to-guides/join.ipynb) about how you can do all variants of `join`, like inner-, left-, right-, anti-, and semi-joins."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Filter\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filter\n",
"\n",
"Let's look at theft crimes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"theft = crime.filter(crime['Primary Type'] == 'THEFT')\n",
"\n",
"theft.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Replace\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Replace\n",
"\n",
"Notice that our `theft` dataset has empty strings in column `Location`. Let's replace those with a missing value:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"theft_replaced = (theft\n",
" .replace_na(\n",
" columns=['Location'], \n",
" use_empty_string_as_na=True\n",
" )\n",
")\n",
"\n",
"theft_replaced.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Advanced features:__ [Learn more](../../how-to-guides/replace-fill-error.ipynb) about more advanced `replace` and `fill` capabilities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Consume\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Consume your cleaned dataset\n",
"\n",
"Azure ML Data Prep allows you to \"choose your own adventure\" once you're done wrangling. You can:\n",
"\n",
"1. Write to a pandas dataframe\n",
"2. Execute on Spark\n",
"3. Consume directly in Azure Machine Learning models\n",
"\n",
"In this quickstart guide, we'll show how you can export to a pandas dataframe.\n",
"\n",
"__Advanced features:__ \n",
"* One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out.\n",
"* You can directly consume your new DataFlow in model builders like Azure Machine Learning's [automated machine learning](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/dataprep/auto-ml-dataprep.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"theft_replaced.to_pandas_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Explore\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore advanced features\n",
"\n",
"Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:\n",
"\n",
"* [Cache your Dataflow to speed up your iterations](../../how-to-guides/cache.ipynb)\n",
"* [Add your custom Python transforms](../../how-to-guides/custom-python-transforms.ipynb)\n",
"* [Impute missing values](../../how-to-guides/impute-missing-values.ipynb)\n",
"* [Sample your data](../../how-to-guides/subsetting-sampling.ipynb)\n",
"* [Reference and link between Dataflows](../../how-to-guides/join.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/tutorials/getting-started/getting-started.png)"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}