MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/append-columns-and-rows.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/append-columns-and-rows.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Append Columns and Rows\n",
        "\n",
        "Often the data we want does not come in a single dataset: they are coming from different locations, have features that are separated, or are simply not homogeneous. Unsurprisingly, we typically want to work with a single dataset at a time.\n",
        "\n",
        "Azure ML Data Prep allows the concatenation of two or more dataflows by means of column and row appends.\n",
        "\n",
        "We will demonstrate this by defining a single dataflow that will pull data from multiple datasets.\n",
        "\n",
        "## Table of Contents\n",
        "[append_columns(dataflows)](#append_columns)<br>\n",
        "[append_rows(dataflows)](#append_rows)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"append_columns\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## `append_columns(dataflows)`\n",
        "We can append data width-wise, which will change some or all existing rows and potentially adding rows (based on an assumption that data in two datasets are aligned on row number).\n",
        "\n",
        "However we cannot do this if the reference dataflows have clashing schema with the target dataflow. Observe:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')\n",
        "dflow_chicago.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import ExecutionError\n",
        "try:\n",
        "    dflow_combined_by_column = dflow.append_columns([dflow_chicago])\n",
        "    dflow_combined_by_column.head(5)\n",
        "except ExecutionError:\n",
        "    print('Cannot append_columns with schema clash!')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As expected, we cannot call `append_columns` with target dataflows that have clashing schema.\n",
        "\n",
        "We can make the call once we rename or drop the offending columns. In more complex scenarios, we could opt to skip or filter to make rows align before appending columns. Here we will choose to simply drop the clashing column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_combined_by_column = dflow.append_columns([dflow_chicago.drop_columns(['Ward'])])\n",
        "dflow_combined_by_column.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that the resultant schema has more columns in the first N records (N being the number of records in `dataflow` and the extra columns being the width of the schema of our reference dataflow, chicago, minus the `Ward` column). From the N+1th record onwards, we will only have a schema width matching that of the `Ward`-less chicago set.\n",
        "\n",
        "Why is this? As much as possible, the data from the reference dataflow(s) will be attached to existing rows in the target dataflow. If there are not enough rows in the target dataflow to attach to, we simply append them as new rows.\n",
        "\n",
        "Note that these are appends, not joins (for joins please reference [Join](join.ipynb)), so the append may not be logically correct, but will take effect as long as there are no schema clashes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Ward-less data after we skip the first N rows\n",
        "dflow_len = dflow.row_count\n",
        "dflow_combined_by_column.skip(dflow_len).head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"append_rows\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## `append_rows(dataflows)`\n",
        "We can append data length-wise, which will only have the effect of adding new rows. No existing data will be changed."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_spring = auto_read_file(path='../data/crime-spring.csv')\n",
        "dflow_spring.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')\n",
        "dflow_chicago.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_combined_by_row = dflow.append_rows([dflow_chicago, dflow_spring])\n",
        "dflow_combined_by_row.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that neither schema nor data has changed for the target dataflow.\n",
        "\n",
        "If we skip ahead, we will see our target dataflows' data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# chicago data\n",
        "dflow_len = dflow.row_count\n",
        "dflow_combined_by_row.skip(dflow_len).head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# crimes spring data\n",
        "dflow_chicago_len = dflow_chicago.row_count\n",
        "dflow_combined_by_row.skip(dflow_len + dflow_chicago_len).head(5)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}