{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Many Models - Automated ML\n", "**_Generate many models time series forecasts with Automated Machine Learning_**\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this notebook we are using a synthetic dataset portraying sales data to predict the quantity of a vartiety of product SKUs across several states, stores, and product categories.\n", "\n", "**NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 320 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429).**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prerequisites\n", "You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.0 Set up workspace, datastore, experiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1613003526897 } }, "outputs": [], "source": [ "import azureml.core\n", "from azureml.core import Workspace, Datastore\n", "import pandas as pd\n", "\n", "# Set up your workspace\n", "ws = Workspace.from_config()\n", "ws.get_details()\n", "\n", "# Set up your datastores\n", "dstore = ws.get_default_datastore()\n", "\n", "output = {}\n", "output[\"SDK version\"] = azureml.core.VERSION\n", "output[\"Subscription ID\"] = ws.subscription_id\n", "output[\"Workspace\"] = ws.name\n", "output[\"Resource Group\"] = ws.resource_group\n", "output[\"Location\"] = ws.location\n", "output[\"Default datastore name\"] = dstore.name\n", "pd.set_option(\"display.max_colwidth\", -1)\n", "outputDf = pd.DataFrame(data=output, index=[\"\"])\n", "outputDf.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choose an experiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1613003540729 } }, "outputs": [], "source": [ "from azureml.core import Experiment\n", "\n", "experiment = Experiment(ws, \"automl-many-models\")\n", "\n", "print(\"Experiment name: \" + experiment.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.0 Data\n", "\n", "This notebook uses simulated orange juice sales data to walk you through the process of training many models on Azure Machine Learning using Automated ML. \n", "\n", "The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset which featured two years of sales of 3 different orange juice brands for individual stores. The full simulated dataset includes 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.\n", "\n", " \n", "In this notebook, two datasets will be created: one with all 11,973 files and one with only 10 files that can be used to quickly test and debug. For each dataset, you'll be walked through the process of:\n", "\n", "1. Registering the blob container as a Datastore to the Workspace\n", "2. Registering a tabular dataset to the Workspace" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### 2.1 Data Preparation\n", "The OJ data is available in the public blob container. The data is split to be used for training and for inferencing. For the current dataset, the data was split on time column ('WeekStarting') before and after '1992-5-28' .\n", "\n", "The container has\n", "