Introducing Unsupervised and Elastic Training on Amazon SageMaker HyperPod | Amazon Web Services

Voiced by Polly

Today, we’re announcing two new AI model training features within Amazon SageMaker HyperPod: checkpoint-free training, an approach that alleviates the need for traditional checkpoint-based recovery by enabling peer-to-peer state recovery, and elastic training, which enables automatic scaling of AI workloads based on resource availability.

  • Training without checkpoints – Training without checkpoints eliminates disruptive checkpoint restart cycles, keeps training momentum moving forward through failure, and cuts recovery time from hours to minutes. Accelerate your AI model development, reclaim days from development timelines, and scale training workflows across thousands of AI accelerators with confidence.
  • Elastic training – Elastic training maximizes cluster utilization as training workloads automatically scale up to use unused capacity as it becomes available and contract to yield resources when higher-priority workloads, such as inferred volumes, peak. Save hours of engineering time per week spent reconfiguring training jobs based on computing availability.

Rather than spending time managing the training infrastructure, these new training techniques mean your team can fully focus on improving model performance, ultimately allowing your AI models to be brought to market faster. By removing traditional checkpoint dependencies and taking full advantage of available capacity, you can significantly reduce model training completion time.

Training without checkpoints: How it works

Traditional checkpoint-based recovery has these sequential job phases: 1) job termination and restart, 2) process discovery and network setup, 3) checkpoint loading, 4) data loader initialization, and 5) training loop recovery. When failure occurs, each phase can become a bottleneck, and training recovery can take up to an hour for self-managed training clusters. The entire cluster must wait for each individual phase to complete before training can continue. This can result in the entire training cluster being idle during recovery operations, increasing costs and increasing time to market.

Checkpointless training completely removes this bottleneck by maintaining a continuous state of the model throughout the training cluster. When a failure occurs, the system recovers immediately using healthy counterparts, avoiding the need for checkpoint-based recovery that requires restarting the entire job. As a result, training without checkpoints allows error recovery in minutes.

Checkpointless training is designed for incremental adoption and is built around four core components that work together: 1) optimization of collective communication initialization, 2) loading of memory-mapped data to enable caching, 3) in-process recovery, and 4) checkpoint-free peer-to-peer state replication. These components are organized through the HyperPod training operator that is used to run the job. Each component optimizes a specific step in the recovery process and together they enable automatic detection and recovery of infrastructure failures in minutes without manual intervention, even with thousands of AI accelerators. You can gradually activate each of these functions as training weights.

The latest Amazon Nova models have been trained using this technology on tens of thousands of accelerators. Additionally, based on internal studies on cluster sizes ranging from 16 GPUs to over 2,000 GPUs, checkpointless training has shown significant recovery time improvements and reduced downtime by more than 80% compared to traditional checkpoint-based recovery.

To learn more, visit the checkpointless training implementation GitHub page and HyperPod Checkpointless Training in the Amazon SageMaker AI Developer Guide.

Elastic training: How it works

On clusters running different types of modern AI workloads, accelerator availability can change continuously throughout the day as short-term training is completed, inference spikes occur, and resources are subsidized or released from completed experiments. Despite this dynamic availability of AI accelerators, traditional training workloads remain locked into their initial allocation and cannot use idle accelerators without manual intervention. This valuable rigidity leaves GPU capacity unused and prevents organizations from maximizing infrastructure investment.

Elastic training changes the way training workloads interact with cluster resources. Training workloads can automatically scale to take advantage of available accelerators and seamlessly scale back when resources are needed elsewhere, all while maintaining training quality.

Workload elasticity is enabled through the HyperPod training operator, which orchestrates scaling decisions through integration with the Kubernetes control plane and resource scheduler. It continuously monitors the health of the cluster through three primary channels: pod lifecycle events, node availability changes, and resource scheduler priority signals. This comprehensive monitoring enables near-instant detection of scaling opportunities, whether from newly available resources or demands from higher-priority workloads.

The scaling mechanism relies on adding and removing data parallel replicas. As additional computing resources become available, new parallel replicas of the data are attached to the training task, thereby speeding up throughput. Conversely, during scaling events (such as when a higher-priority workload requires resources), the system scales down by removing replicas instead of terminating the entire job, allowing training to continue with reduced capacity.

At different scales, the system preserves the global batch size and adapts the learning rate to avoid adversely affecting model convergence. This allows workloads to dynamically scale up or down and take advantage of available AI accelerators without any manual intervention.

You can start elastic training with HyperPod recipes for publicly available base models (FMs) including Llama and GPT-OSS. In addition, you can modify your PyTorch training scripts to add elastic event handlers that allow dynamic workload scaling.

To learn more, visit HyperPod Elastic Training in the Amazon SageMaker AI Developer Guide. To get started, find the HyperPod recipes available in the AWS GitHub repository.

Now available

Both features are available in all regions where Amazon SageMaker HyperPod is available. You can use these training techniques at no additional cost. To learn more, visit the SageMaker HyperPod product page and the SageMaker AI pricing page.

Give it a try and submit feedback to AWS re:Post for SageMaker or through your usual AWS support contacts.

— Channy

Leave a Comment