From 63d1d57dfb3b0ed1f5f6e7c8af5e8a1ae8be62e3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Shan=C3=A9=20Winner?= <43390034+swinner95@users.noreply.github.com> Date: Sun, 28 Jul 2019 00:25:21 -0700 Subject: [PATCH] Delete random-split.ipynb --- .../dataprep/how-to-guides/random-split.ipynb | 146 ------------------ 1 file changed, 146 deletions(-) delete mode 100644 how-to-use-azureml/work-with-data/dataprep/how-to-guides/random-split.ipynb diff --git a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/random-split.ipynb b/how-to-use-azureml/work-with-data/dataprep/how-to-guides/random-split.ipynb deleted file mode 100644 index 26b53043..00000000 --- a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/random-split.ipynb +++ /dev/null @@ -1,146 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/random-split.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Random Split\n", - "Copyright (c) Microsoft Corporation. All rights reserved.
\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Azure ML Data Prep provides the functionality of splitting a data set into two. When training a machine learning model, it is often desirable to train the model on a subset of data, then validate the model on a different subset." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `random_split(percentage, seed=None)` function in Data Prep takes in a Dataflow and randomly splitting it into two distinct subsets (approximately by the percentage specified)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `random_split` will receive different seeds." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To demonstrate, you can go through the following example. First, you can read the first 10,000 lines from a file. Since the contents of the file don't matter, just the first two columns can be used for a simple example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv').take(10000)\n", - "dflow = dflow.keep_columns(['ID', 'Date'])\n", - "profile = dflow.get_profile()\n", - "print('Row count: %d' % (profile.columns['ID'].count))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, you can call `random_split` with the percentage set to 10% (the actual split ratio will be an approximation of `percentage`). You can take a look at the row count of the first returned Dataflow. You should see that `dflow_test` has approximately 1,000 rows (10% of 10,000)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "(dflow_test, dflow_train) = dflow.random_split(percentage=0.1)\n", - "profile_test = dflow_test.get_profile()\n", - "print('Row count of \"test\": %d' % (profile_test.columns['ID'].count))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now you can take a look at the row count of the second returned Dataflow. The row count of `dflow_test` and `dflow_train` sums exactly to 10,000, because `random_split` results in two subsets that make up the original Dataflow." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "profile_train = dflow_train.get_profile()\n", - "print('Row count of \"train\": %d' % (profile_train.columns['ID'].count))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To specify a fixed seed, simply provide it to the `random_split` function." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "(dflow_test, dflow_train) = dflow.random_split(percentage=0.1, seed=12345)" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file