update samples - test

2019-11-01 14:48:01 +00:00
parent 46ec74f8df
commit 4ed3f0767a
308 changed files with 13971 additions and 59495 deletions
--- a/tutorials/img-classification-part1-training.ipynb
+++ b/tutorials/img-classification-part1-training.ipynb
@@ -17,7 +17,7 @@
        "\n",
        "In this tutorial, you train a machine learning model on remote compute resources. You'll use the training and deployment workflow for Azure Machine Learning service (preview) in a Python Jupyter notebook.  You can then use the notebook as a template to train your own machine learning model with your own data. This tutorial is **part one of a two-part tutorial series**.  \n",
        "\n",
-        "This tutorial trains a simple logistic regression using the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and [scikit-learn](http://scikit-learn.org) with Azure Machine Learning.  MNIST is a popular dataset consisting of 70,000 grayscale images. Each image is a handwritten digit of 28x28 pixels, representing a number from 0 to 9. The goal is to create a multi-class classifier to identify the digit a given image represents. \n",
+        "This tutorial trains a simple logistic regression using the [MNIST](https://azure.microsoft.com/services/open-datasets/catalog/mnist/) dataset and [scikit-learn](http://scikit-learn.org) with Azure Machine Learning.  MNIST is a popular dataset consisting of 70,000 grayscale images. Each image is a handwritten digit of 28x28 pixels, representing a number from 0 to 9. The goal is to create a multi-class classifier to identify the digit a given image represents. \n",
        "\n",
        "Learn how to:\n",
        "\n",
@@ -158,9 +158,9 @@
        "if compute_name in ws.compute_targets:\n",
        "    compute_target = ws.compute_targets[compute_name]\n",
        "    if compute_target and type(compute_target) is AmlCompute:\n",
-        "        print('found compute target. just use it. ' + compute_name)\n",
+        "        print(\"found compute target: \" + compute_name)\n",
        "else:\n",
-        "    print('creating a new compute target...')\n",
+        "    print(\"creating new compute target...\")\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,\n",
        "                                                                min_nodes = compute_min_nodes, \n",
        "                                                                max_nodes = compute_max_nodes)\n",
@@ -191,7 +191,11 @@
        "\n",
        "### Download the MNIST dataset\n",
        "\n",
-        "Download the MNIST dataset and save the files into a `data` directory locally.  Images and labels for both training and testing are downloaded."
+        "Use Azure Open Datasets to get the raw MNIST data files. [Azure Open Datasets](https://docs.microsoft.com/azure/open-datasets/overview-what-are-open-datasets) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Each dataset has a corrseponding class, `MNIST` in this case, to retrieve the data in different ways.\n",
+        "\n",
+        "This code retrieves the data as a `FileDataset` object, which is a subclass of `Dataset`. A `FileDataset` references single or multiple files of any format in your datastores or public urls. The class provides you with the ability to download or mount the files to your compute by creating a reference to the data source location. Additionally, you register the Dataset to your workspace for easy retrieval during training.\n",
+        "\n",
+        "Follow the [how-to](https://aka.ms/azureml/howto/createdatasets) to learn more about Datasets and their usage in the SDK."
      ]
    },
    {
@@ -200,15 +204,19 @@
      "metadata": {},
      "outputs": [],
      "source": [
-        "import urllib.request\n",
+        "from azureml.core import Dataset\n",
+        "from azureml.opendatasets import MNIST\n",
        "\n",
        "data_folder = os.path.join(os.getcwd(), 'data')\n",
        "os.makedirs(data_folder, exist_ok=True)\n",
        "\n",
-        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename=os.path.join(data_folder, 'train-images.gz'))\n",
-        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename=os.path.join(data_folder, 'train-labels.gz'))\n",
-        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename=os.path.join(data_folder, 'test-images.gz'))\n",
-        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename=os.path.join(data_folder, 'test-labels.gz'))"
+        "mnist_file_dataset = MNIST.get_file_dataset()\n",
+        "mnist_file_dataset.download(data_folder, overwrite=True)\n",
+        "\n",
+        "mnist_file_dataset = mnist_file_dataset.register(workspace=ws,\n",
+        "                                                 name='mnist_opendataset',\n",
+        "                                                 description='training and test dataset',\n",
+        "                                                 create_new_version=True)"
      ]
    },
    {
@@ -230,10 +238,10 @@
        "from utils import load_data\n",
        "\n",
        "# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.\n",
-        "X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0\n",
-        "X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0\n",
-        "y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1)\n",
-        "y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1)\n",
+        "X_train = load_data(os.path.join(data_folder, \"train-images-idx3-ubyte.gz\"), False) / 255.0\n",
+        "X_test = load_data(os.path.join(data_folder, \"t10k-images-idx3-ubyte.gz\"), False) / 255.0\n",
+        "y_train = load_data(os.path.join(data_folder, \"train-labels-idx1-ubyte.gz\"), True).reshape(-1)\n",
+        "y_test = load_data(os.path.join(data_folder, \"t10k-labels-idx1-ubyte.gz\"), True).reshape(-1)\n",
        "\n",
        "# now let's show some randomly chosen images from the traininng set.\n",
        "count = 0\n",
@@ -249,65 +257,6 @@
        "plt.show()"
      ]
    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Create a FileDataset\n",
-        "A FileDataset references single or multiple files in your datastores or public urls. The files can be of any format. FileDataset provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. [Learn More](https://aka.ms/azureml/howto/createdatasets)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "tags": [
-          "use datastore",
-          "dataset-remarks-file-sample"
-        ]
-      },
-      "outputs": [],
-      "source": [
-        "from azureml.core.dataset import Dataset\n",
-        "\n",
-        "web_paths = [\n",
-        "            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',\n",
-        "            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',\n",
-        "            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',\n",
-        "            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'\n",
-        "            ]\n",
-        "dataset = Dataset.File.from_files(path = web_paths)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Use the `register()` method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "dataset = dataset.register(workspace = ws,\n",
-        "                           name = 'mnist dataset',\n",
-        "                           description='training and test dataset',\n",
-        "                           create_new_version=True)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# list the files referenced by dataset\n",
-        "dataset.to_path()"
-      ]
-    },
    {
      "cell_type": "markdown",
      "metadata": {},
@@ -488,7 +437,7 @@
        "\n",
        "script_params = {\n",
        "    # to mount files referenced by mnist dataset\n",
-        "    '--data-folder': dataset.as_named_input('mnist').as_mount(),\n",
+        "    '--data-folder': mnist_file_dataset.as_named_input('mnist_opendataset').as_mount(),\n",
        "    '--regularization': 0.5\n",
        "}\n",
        "\n",
@@ -699,7 +648,7 @@
  "metadata": {
    "authors": [
      {
-        "name": "roastala"
+        "name": "maxluk"
      }
    ],
    "kernelspec": {
--- a/tutorials/img-classification-part1-training.yml
+++ b/tutorials/img-classification-part1-training.yml
@@ -6,3 +6,4 @@ dependencies:
  - matplotlib
  - sklearn
  - pandas
+  - azureml-opendatasets
--- a/tutorials/img-classification-part2-deploy.ipynb
+++ b/tutorials/img-classification-part2-deploy.ipynb
@@ -289,12 +289,12 @@
        "from sklearn.externals import joblib\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "\n",
-        "from azureml.core.model import Model\n",
-        "\n",
        "def init():\n",
        "    global model\n",
-        "    # retrieve the path to the model file using the model name\n",
-        "    model_path = Model.get_model_path('sklearn_mnist')\n",
+        "    # AZUREML_MODEL_DIR is an environment variable created during deployment.\n",
+        "    # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)\n",
+        "    # For multiple models, it points to the folder containing all deployed models (./azureml-models)\n",
+        "    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'sklearn_mnist_model.pkl')\n",
        "    model = joblib.load(model_path)\n",
        "\n",
        "def run(raw_data):\n",
@@ -598,7 +598,7 @@
  "metadata": {
    "authors": [
      {
-        "name": "roastala"
+        "name": "shipatel"
      }
    ],
    "kernelspec": {
--- a/tutorials/tutorial-1st-experiment-sdk-train.ipynb
+++ b/tutorials/tutorial-1st-experiment-sdk-train.ipynb
@@ -98,7 +98,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "For this tutorial, you use the diabetes data set, which is a pre-normalized data set included in scikit-learn. This data set uses features like age, gender, and BMI to predict diabetes disease progression. Load the data from the `load_diabetes()` static function, and split it into training and test sets using `train_test_split()`. This function segregates the data so the model has unseen data to use for testing following training."
+        "For this tutorial, you use the diabetes data set, which uses features like age, gender, and BMI to predict diabetes disease progression. Load the data from the Azure Open Datasets class, and split it into training and test sets using `train_test_split()`. This function segregates the data so the model has unseen data to use for testing following training."
      ]
    },
    {
@@ -107,11 +107,13 @@
      "metadata": {},
      "outputs": [],
      "source": [
-        "from sklearn.datasets import load_diabetes\n",
+        "from azureml.opendatasets import Diabetes\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
-        "X, y = load_diabetes(return_X_y = True)\n",
-        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=66)"
+        "x_df = Diabetes.get_tabular_dataset().to_pandas_dataframe().dropna()\n",
+        "y_df = x_df.pop(\"Y\")\n",
+        "\n",
+        "X_train, X_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)"
      ]
    },
    {
--- a/tutorials/tutorial-1st-experiment-sdk-train.yml
+++ b/tutorials/tutorial-1st-experiment-sdk-train.yml
@@ -3,3 +3,4 @@ dependencies:
 - pip:
  - azureml-sdk
  - sklearn
+  - azureml-opendatasets