update samples from Release-132 as a part of 1.0.48 SDK release

2025-12-21 18:15:13 -05:00 · 2019-07-09 22:02:57 +00:00
parent 9e0fc4f0e7
commit 475ea36106
195 changed files with 31305 additions and 4675 deletions
--- a/work-with-data/dataprep/how-to-guides/random-split.ipynb
+++ b/work-with-data/dataprep/how-to-guides/random-split.ipynb
@@ -0,0 +1,145 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/random-split.png)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Random Split\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import azureml.dataprep as dprep"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Azure ML Data Prep provides the functionality of splitting a data set into two. When training a machine learning model, it is often desirable to train the model on a subset of data, then validate the model on a different subset."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The `random_split(percentage, seed=None)` function in Data Prep takes in a Dataflow and randomly splitting it into two distinct subsets (approximately by the percentage specified)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `random_split` will receive different seeds."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "To demonstrate, you can go through the following example. First, you can read the first 10,000 lines from a file. Since the contents of the file don't matter, just the first two columns can be used for a simple example."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "dflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv').take(10000)\n",
+        "dflow = dflow.keep_columns(['ID', 'Date'])\n",
+        "profile = dflow.get_profile()\n",
+        "print('Row count: %d' % (profile.columns['ID'].count))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Next, you can call `random_split` with the percentage set to 10% (the actual split ratio will be an approximation of `percentage`). You can take a look at the row count of the first returned Dataflow. You should see that `dflow_test` has approximately 1,000 rows (10% of 10,000)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "(dflow_test, dflow_train) = dflow.random_split(percentage=0.1)\n",
+        "profile_test = dflow_test.get_profile()\n",
+        "print('Row count of \"test\": %d' % (profile_test.columns['ID'].count))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Now you can take a look at the row count of the second returned Dataflow. The row count of `dflow_test` and `dflow_train` sums exactly to 10,000, because `random_split` results in two subsets that make up the original Dataflow."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "profile_train = dflow_train.get_profile()\n",
+        "print('Row count of \"train\": %d' % (profile_train.columns['ID'].count))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "To specify a fixed seed, simply provide it to the `random_split` function."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "(dflow_test, dflow_train) = dflow.random_split(percentage=0.1, seed=12345)"
+      ]
+    }
+  ],
+  "metadata": {
+    "authors": [
+      {
+        "name": "sihhu"
+      }
+    ],
+    "kernelspec": {
+      "display_name": "Python 3.6",
+      "language": "python",
+      "name": "python36"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.6.4"
+    },
+    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}