update samples from Release-168 as a part of SDK release

2025-12-19 17:17:04 -05:00 · 2022-12-05 17:52:07 +00:00
parent 38d5743bbb
commit 4404e62f58
44 changed files with 187 additions and 814 deletions
--- a/how-to-use-azureml/automated-machine-learning/forecasting-many-models/auto-ml-forecasting-many-models.ipynb
+++ b/how-to-use-azureml/automated-machine-learning/forecasting-many-models/auto-ml-forecasting-many-models.ipynb
@@ -379,8 +379,16 @@
      "source": [
        "### Set up training parameters\n",
        "\n",
-        "This dictionary defines the AutoML and many models settings. For this forecasting task we need to define several settings inncluding the name of the time column, the maximum forecast horizon, and the partition column name definition.\n",
+        "We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``ManyModelsTrainParameters`` objects. For the forecasting task we also need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name definition.\n",
        "\n",
+        "#### ``ForecastingParameters`` arguments\n",
+        "| Property                           | Description|\n",
+        "| :---------------                   | :------------------- |\n",
+        "| **forecast_horizon**               | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
+        "| **time_column_name**               | The name of your time column. |\n",
+        "| **time_series_id_column_names**    | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
+        "\n",
+        "#### ``AutoMLConfig`` arguments\n",
        "| Property                           | Description|\n",
        "| :---------------                   | :------------------- |\n",
        "| **task**                           | forecasting |\n",
@@ -390,13 +398,10 @@
        "| **iterations**                     | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
        "| **experiment_timeout_hours**       | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
        "| **label_column_name**              | The name of the label column. |\n",
-        "| **forecast_horizon**               | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
-        "| **n_cross_validations**            | Number of cross validation splits. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
-        "|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
+        "| **n_cross_validations**            | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
+        "| **cv_step_size**                   |Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
        "| **enable_early_stopping**          | Flag to enable early termination if the score is not improving in the short term. |\n",
-        "| **time_column_name**               | The name of your time column. |\n",
        "| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
-        "| **time_series_id_column_names**     | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
        "| **track_child_runs**               | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
        "| **pipeline_fetch_max_batch_size**  | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
        "| **partition_column_names**         | The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series. |"
@@ -415,23 +420,30 @@
        "from azureml.train.automl.runtime._many_models.many_models_parameters import (\n",
        "    ManyModelsTrainParameters,\n",
        ")\n",
+        "from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
+        "from azureml.train.automl.automlconfig import AutoMLConfig\n",
        "\n",
        "partition_column_names = [\"Store\", \"Brand\"]\n",
-        "automl_settings = {\n",
-        "    \"task\": \"forecasting\",\n",
-        "    \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
-        "    \"iteration_timeout_minutes\": 10,  # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value\n",
-        "    \"iterations\": 15,\n",
-        "    \"experiment_timeout_hours\": 0.25,\n",
-        "    \"label_column_name\": \"Quantity\",\n",
-        "    \"n_cross_validations\": \"auto\",  # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
-        "    \"cv_step_size\": \"auto\",\n",
-        "    \"time_column_name\": \"WeekStarting\",\n",
-        "    \"drop_column_names\": \"Revenue\",\n",
-        "    \"forecast_horizon\": 6,\n",
-        "    \"time_series_id_column_names\": partition_column_names,\n",
-        "    \"track_child_runs\": False,\n",
-        "}\n",
+        "\n",
+        "forecasting_parameters = ForecastingParameters(\n",
+        "    time_column_name=\"WeekStarting\",\n",
+        "    drop_column_names=\"Revenue\",\n",
+        "    forecast_horizon=6,\n",
+        "    time_series_id_column_names=partition_column_names,\n",
+        ")\n",
+        "\n",
+        "automl_settings = AutoMLConfig(\n",
+        "    task=\"forecasting\",\n",
+        "    primary_metric=\"normalized_root_mean_squared_error\",\n",
+        "    iteration_timeout_minutes=10,\n",
+        "    iterations=15,\n",
+        "    experiment_timeout_hours=0.25,\n",
+        "    label_column_name=\"Quantity\",\n",
+        "    n_cross_validations=\"auto\",  # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
+        "    cv_step_size=\"auto\",\n",
+        "    track_child_runs=False,\n",
+        "    forecasting_parameters=forecasting_parameters,\n",
+        ")\n",
        "\n",
        "mm_paramters = ManyModelsTrainParameters(\n",
        "    automl_settings=automl_settings, partition_column_names=partition_column_names\n",
@@ -498,6 +510,7 @@
        "| **node_count**                     | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
        "| **process_count_per_node**         | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance. |\n",
        "| **train_pipeline_parameters**      | The set of configuration parameters defined in the previous section. |\n",
+        "| **run_invocation_timeout**         | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
        "\n",
        "Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
      ]
@@ -667,9 +680,9 @@
        "| :---------------                   | :------------------- |\n",
        "| **experiment**                     | The experiment used for inference run. |\n",
        "| **inference_data**                 | The data to use for inferencing. It should be the same schema as used for training.\n",
-        "| **compute_target** The compute target that runs the inference pipeline.|\n",
+        "| **compute_target**                 | The compute target that runs the inference pipeline. |\n",
        "| **node_count**                     | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
-        "| **process_count_per_node** The number of processes per node.\n",
+        "| **process_count_per_node**         | The number of processes per node (should be at most half of the number of cores of the compute cluster that will be used for the experiment).\n",
        "| **train_run_id**                   | \\[Optional] The run id of the hierarchy training, by default it is the latest successful training many model run in the experiment. |\n",
        "| **train_experiment_name**          | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
        "| **process_count_per_node**         | \\[Optional] The number of processes per node, by default it's 4. |"
@@ -692,6 +705,8 @@
        "    target_column_name=\"Quantity\",\n",
        ")\n",
        "\n",
+        "output_file_name = \"parallel_run_step.csv\"\n",
+        "\n",
        "inference_steps = AutoMLPipelineBuilder.get_many_models_batch_inference_steps(\n",
        "    experiment=experiment,\n",
        "    inference_data=inference_ds_small,\n",
@@ -703,6 +718,8 @@
        "    train_run_id=training_run.id,\n",
        "    train_experiment_name=training_run.experiment.name,\n",
        "    inference_pipeline_parameters=mm_parameters,\n",
+        "    append_row_file_name=output_file_name,\n",
+        "    arguments=[\"--forecast_quantiles\", 0.1, 0.9],\n",
        ")"
      ]
    },
@@ -737,7 +754,7 @@
        "\n",
        "The following code snippet:\n",
        "1. Downloads the contents of the output folder that is passed in the parallel run step \n",
-        "2. Reads the parallel_run_step.txt file that has the predictions as pandas dataframe and \n",
+        "2. Reads the output file that has the predictions as pandas dataframe and \n",
        "3. Displays the top 10 rows of the predictions"
      ]
    },
@@ -752,19 +769,9 @@
        "forecasting_results_name = \"forecasting_results\"\n",
        "forecasting_output_name = \"many_models_inference_output\"\n",
        "forecast_file = get_output_from_mm_pipeline(\n",
-        "    inference_run, forecasting_results_name, forecasting_output_name\n",
+        "    inference_run, forecasting_results_name, forecasting_output_name, output_file_name\n",
        ")\n",
-        "df = pd.read_csv(forecast_file, delimiter=\" \", header=None)\n",
-        "df.columns = [\n",
-        "    \"Week Starting\",\n",
-        "    \"Store\",\n",
-        "    \"Brand\",\n",
-        "    \"Quantity\",\n",
-        "    \"Advert\",\n",
-        "    \"Price\",\n",
-        "    \"Revenue\",\n",
-        "    \"Predicted\",\n",
-        "]\n",
+        "df = pd.read_csv(forecast_file)\n",
        "print(\n",
        "    \"Prediction has \", df.shape[0], \" rows. Here the first 10 rows are being displayed.\"\n",
        ")\n",