MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/working-with-file-streams.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/working-with-file-streams.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Working With File Streams\n",
        "Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In addition to loading and parsing tabular data (see [here](./data-ingestion.ipynb) for more details), Data Prep also supports a variety of operations on raw file streams. \n",
        "\n",
        "File streams are usually created by calling `Dataflow.get_files`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dprep.Dataflow.get_files(path='../data/*.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The result of this operation is a Dataflow with a single column named \"Path\". This column contains values of type `StreamInfo`, each of which represents a different file matched by the search pattern specified when calling `get_files`. The string representation of a `StreamInfo` follows this pattern:\n",
        "\n",
        "StreamInfo(_Location_://_ResourceIdentifier_\\[_Arguments_\\])\n",
        "\n",
        "Location is the type of storage where the stream is located (e.g. Azure Blob, Local, or ADLS); ResouceIdentifier is the name of the file within that storage, such as a file path; and Arguments is a list of arguments required to load and read the file.\n",
        "\n",
        "On their own, `StreamInfo` objects are not particularly useful; however, you can use them as input to other functions."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Retrieving File Names\n",
        "\n",
        "In the example above, we matched a set of CSV files by using a search pattern and got back a column with several `StreamInfo` objects, each representing a different file. Now, we will extract the file path and name for each of these values into a new string column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dflow.add_column(expression=dprep.get_stream_name(dflow['Path']),\n",
        "                         new_column_name='FilePath',\n",
        "                         prior_column='Path')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `get_stream_name` function will return the full name of the file referenced by a `StreamInfo`. In the case of a local file, this will be an absolute path. From here, you can use the `derive_column_by_example` method to extract just the file name."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "first_file_path = dflow.head(1)['FilePath'][0]\n",
        "first_file_name = os.path.basename(first_file_path)\n",
        "dflow = dflow.derive_column_by_example(new_column_name='FileName',\n",
        "                                       source_columns=['FilePath'],\n",
        "                                       example_data=(first_file_path, first_file_name))\n",
        "dflow = dflow.drop_columns(['FilePath'])\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Writing Streams\n",
        "\n",
        "Whenever you have a column containing `StreamInfo` objects, it's possible to write these out to any of the locations Data Prep supports. You can do this by calling `Dataflow.write_streams`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow.write_streams(streams_column='Path', base_path=dprep.LocalFileOutput('./test_out/')).run_local()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `base_path` parameter specifies the location the files will be written to. By default, the name of the file will be the resource identifier of the stream with any invalid characters replaced by `_`. In the case of streams referencing local files, this would be the full path of the original file. You can also specify the desired file names by referencing a column containing them:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow.write_streams(streams_column='Path',\n",
        "                    base_path=dprep.LocalFileOutput('./test_out/'),\n",
        "                    file_names_column='FileName').run_local()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Using this functionality, you can transfer files from any source to any destination supported by Data Prep. In addition, since the streams are just values in the Dataflow, you can use all of the functionality available.\n",
        "\n",
        "Here, for example, we will write out only the files that start with the prefix \"crime-\". The resulting file names will have the prefix stripped and will be written to a folder named \"crime\"."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "prefix = 'crime-'\n",
        "dflow = dflow.filter(dflow['FileName'].starts_with(prefix))\n",
        "dflow = dflow.add_column(expression=dflow['FileName'].substring(len(prefix)),\n",
        "                         new_column_name='CleanName',\n",
        "                         prior_column='FileName')\n",
        "dflow.write_streams(streams_column='Path',\n",
        "                    base_path=dprep.LocalFileOutput('./test_out/crime/'),\n",
        "                    file_names_column='CleanName').run_local()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}