Files
MachineLearningNotebooks/training

Training ML models with Azure ML SDK

These notebook tutorials cover the various scenarios for training machine learning and deep learning models with Azure Machine Learning.

Sample notebooks

  • 01.train-hyperparameter-tune-deploy-with-pytorch
    Train, hyperparameter tune, and deploy a PyTorch image classification model that distinguishes bees vs. ants using transfer learning. Azure ML concepts covered:

    • Create a remote compute target (Batch AI cluster)
    • Upload training data using Datastore
    • Run a single-node PyTorch training job
    • Hyperparameter tune model with HyperDrive
    • Find and register the best model
    • Deploy model to ACI
  • 02.distributed-pytorch-with-horovod
    Train a PyTorch model on the MNIST dataset using distributed training with Horovod. Azure ML concepts covered:

    • Create a remote compute target (Batch AI cluster)
    • Run a two-node distributed PyTorch training job using Horovod
  • 03.train-hyperparameter-tun-deploy-with-tensorflow
    Train, hyperparameter tune, and deploy a TensorFlow model on the MNIST dataset. Azure ML concepts covered:

    • Create a remote compute target (Batch AI cluster)
    • Upload training data using Datastore
    • Run a single-node TensorFlow training job
    • Leverage features of the Run object
    • Download the trained model
    • Hyperparameter tune model with HyperDrive
    • Find and register the best model
    • Deploy model to ACI
  • 04.distributed-tensorflow-with-horovod
    Train a TensorFlow word2vec model using distributed training with Horovod. Azure ML concepts covered:

    • Create a remote compute target (Batch AI cluster)
    • Upload training data using Datastore
    • Run a two-node distributed TensorFlow training job using Horovod
  • 05.distributed-tensorflow-with-parameter-server
    Train a TensorFlow model on the MNIST dataset using native distributed TensorFlow (parameter server). Azure ML concepts covered:

    • Create a remote compute target (Batch AI cluster)
    • Run a two workers, one parameter server distributed TensorFlow training job
  • 06.distributed-cntk-with-custom-docker
    Train a CNTK model on the MNIST dataset using the Azure ML base Estimator with custom Docker image and distributed training. Azure ML concepts covered:

    • Create a remote compute target (Batch AI cluster)
    • Upload training data using Datastore
    • Run a base Estimator training job using a custom Docker image from Docker Hub
    • Distributed CNTK two-node training job via MPI using base Estimator
  • 07.tensorboard
    Train a TensorFlow MNIST model locally, on a DSVM, and on Batch AI and view the logs live on TensorBoard. Azure ML concepts covered:

    • Run the training job locally with Azure ML and run TensorBoard locally. Start (and stop) an Azure ML TensorBoard object to stream and view the logs
    • Run the training job on a remote DSVM and stream the logs to TensorBoard
    • Run the training job on a remote Batch AI cluster and stream the logs to TensorBoard
    • Start a Tensorboard instance that displays the logs from all three above runs in one
  • 08.export-run-history-to-tensorboard

    • Start an Azure ML Experiment and log metrics to Run history
    • Export the Run history logs to TensorBoard logs
    • View the logs in TensorBoard