update samples from Release-146 as a part of 1.0.62 SDK release

2025-12-19 17:17:04 -05:00 · 2019-09-16 23:21:57 +00:00
parent e1724c8a89
commit 6bb1e2a3e3
96 changed files with 11640 additions and 2027 deletions
--- a/tutorials/img-classification-part1-training.ipynb
+++ b/tutorials/img-classification-part1-training.ipynb
@@ -184,11 +184,10 @@
        "\n",
        "## Explore data\n",
        "\n",
-        "Before you train a model, you need to understand the data that you are using to train it.  You also need to copy the data into the cloud so it can be accessed by your cloud training environment.  In this section you learn how to:\n",
+        "Before you train a model, you need to understand the data that you are using to train it. In this section you learn how to:\n",
        "\n",
        "* Download the MNIST dataset\n",
        "* Display some sample images\n",
-        "* Upload data to the cloud\n",
        "\n",
        "### Download the MNIST dataset\n",
        "\n",
@@ -254,13 +253,8 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "Now you have an idea of what these images look like and the expected prediction outcome.\n",
-        "\n",
-        "### Upload data to the cloud\n",
-        "\n",
-        "Now make the data accessible remotely by uploading that data from your local machine into Azure so it can be accessed for remote training. The datastore is a convenient construct associated with your workspace for you to upload/download data, and interact with it from your remote compute targets. It is backed by Azure blob storage account.\n",
-        "\n",
-        "The MNIST files are uploaded into a directory named `mnist` at the root of the datastore. See [access data from your datastores](https://docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/how-to-access-data) for more information."
+        "## Create a FileDataset\n",
+        "A FileDataset references single or multiple files in your datastores or public urls. The files can be of any format. FileDataset provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. [Learn More](https://aka.ms/azureml/howto/createdatasets)"
      ]
    },
    {
@@ -273,10 +267,44 @@
      },
      "outputs": [],
      "source": [
-        "ds = ws.get_default_datastore()\n",
-        "print(ds.datastore_type, ds.account_name, ds.container_name)\n",
+        "from azureml.core.dataset import Dataset\n",
        "\n",
-        "ds.upload(src_dir=data_folder, target_path='mnist', overwrite=True, show_progress=True)"
+        "web_paths = [\n",
+        "            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',\n",
+        "            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',\n",
+        "            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',\n",
+        "            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'\n",
+        "            ]\n",
+        "dataset = Dataset.File.from_files(path = web_paths)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Use the `register()` method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "dataset = dataset.register(workspace = ws,\n",
+        "                           name = 'mnist dataset',\n",
+        "                           description='training and test dataset',\n",
+        "                           create_new_version=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# list the files referenced by dataset\n",
+        "dataset.to_path()"
      ]
    },
    {
@@ -327,6 +355,7 @@
        "import argparse\n",
        "import os\n",
        "import numpy as np\n",
+        "import glob\n",
        "\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.externals import joblib\n",
@@ -334,7 +363,7 @@
        "from azureml.core import Run\n",
        "from utils import load_data\n",
        "\n",
-        "# let user feed in 2 parameters, the location of the data files (from datastore), and the regularization rate of the logistic regression model\n",
+        "# let user feed in 2 parameters, the dataset to mount or download, and the regularization rate of the logistic regression model\n",
        "parser = argparse.ArgumentParser()\n",
        "parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')\n",
        "parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')\n",
@@ -345,10 +374,11 @@
        "\n",
        "# load train and test set into numpy arrays\n",
        "# note we scale the pixel intensity values to 0-1 (by dividing it with 255.0) so the model can converge faster.\n",
-        "X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0\n",
-        "X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0\n",
-        "y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1)\n",
-        "y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1)\n",
+        "X_train = load_data(glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0], False) / 255.0\n",
+        "X_test = load_data(glob.glob(os.path.join(data_folder, '**/t10k-images-idx3-ubyte.gz'), recursive=True)[0], False) / 255.0\n",
+        "y_train = load_data(glob.glob(os.path.join(data_folder, '**/train-labels-idx1-ubyte.gz'), recursive=True)[0], True).reshape(-1)\n",
+        "y_test = load_data(glob.glob(os.path.join(data_folder, '**/t10k-labels-idx1-ubyte.gz'), recursive=True)[0], True).reshape(-1)\n",
+        "\n",
        "print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '\\n')\n",
        "\n",
        "# get hold of the current run\n",
@@ -379,7 +409,7 @@
      "source": [
        "Notice how the script gets data and saves models:\n",
        "\n",
-        "+ The training script reads an argument to find the directory containing the data.  When you submit the job later, you point to the datastore for this argument:\n",
+        "+ The training script reads an argument to find the directory containing the data.  When you submit the job later, you point to the dataset for this argument:\n",
        "`parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')`"
      ]
    },
@@ -424,7 +454,23 @@
        "* The training script name, train.py\n",
        "* Parameters required from the training script \n",
        "\n",
-        "In this tutorial, this target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the datastore (`ds.path('mnist').as_mount()`)."
+        "In this tutorial, the target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from azureml.core.environment import Environment\n",
+        "from azureml.core.conda_dependencies import CondaDependencies\n",
+        "\n",
+        "# to install required packages\n",
+        "env = Environment('my_env')\n",
+        "cd = CondaDependencies.create(pip_packages=['azureml-sdk','scikit-learn','azureml-dataprep[pandas,fuse]>=1.1.14'])\n",
+        "\n",
+        "env.python.conda_dependencies = cd"
      ]
    },
    {
@@ -440,30 +486,16 @@
        "from azureml.train.sklearn import SKLearn\n",
        "\n",
        "script_params = {\n",
-        "    '--data-folder': ds.path('mnist').as_mount(),\n",
+        "    # to mount files referenced by mnist dataset\n",
+        "    '--data-folder': dataset.as_named_input('mnist').as_mount(),\n",
        "    '--regularization': 0.5\n",
        "}\n",
        "\n",
        "est = SKLearn(source_directory=script_folder,\n",
-        "                script_params=script_params,\n",
-        "                compute_target=compute_target,\n",
-        "                entry_script='train.py')"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "This is what the mounting point looks like:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "print(ds.path('mnist').as_mount())"
+        "              script_params=script_params,\n",
+        "              compute_target=compute_target,\n",
+        "              environment_definition=env,\n",
+        "              entry_script='train.py')"
      ]
    },
    {
@@ -684,7 +716,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.6.6"
+      "version": "3.6.9"
    },
    "msauthor": "roastala"
  },
--- a/tutorials/tutorial-pipeline-batch-scoring-classification.ipynb
+++ b/tutorials/tutorial-pipeline-batch-scoring-classification.ipynb