Diving into 3D Parallelism with Heterogeneous Spot Instance GPUs: Design and Implications

Yuxiao Wang, Yuedong Xu, Qingyang Duan and Yuxuan Liu, Lei Jiao, Yinghao Yu, Jun Wu Yuedong Xu is with College of Computer Science and Artificial Intelligence, and Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai, China (e-mail: [email protected])

Abstract

The rapid growth of large language models (LLMs) and the continuous release of new GPU products have significantly increased the demand for distributed training across heterogeneous GPU environments. In this paper, we present a comprehensive analysis of the challenges involved in implementing 3D parallelism in such environments, addressing critical issues such as the need for symmetric tensor parallelism, efficient gradient synchronization in asymmetric pipeline parallelism, and the trade-offs between memory utilization and computational efficiency. Building upon these insights, we introduce AutoHet, a novel system that automatically identifies the optimal parallelism plan for distributed training on heterogeneous GPUs. AutoHet supports asymmetric 3D parallelism structures and facilitates fine-grained workload distribution. We propose a theoretical model that frames the device grouping and load balancing as an optimization problem to minimize per-iteration training time, thus effectively balancing computing power and memory usage across GPUs with diverse capabilities. To enable elastic training upon spot instance preemption, AutoHet presents an efficient recovery strategy that prioritizes to retrieve training states from local nodes, and only downloads the missing checkpoints from the cloud storage. Our extensive evaluation, conducted on three large-scale models and utilizing combinations of three different GPU types, demonstrates that AutoHet outperforms existing DNN training systems, achieving up to a 1.79 $\times$ speedup in training throughput compared with Megatron-LM and Whale, and a 4.38 $\times$ speedup of recovery speed compared to a spot instance baseline.

I Introduction

The “arms race” in large language models (LLM) like GPT-3 [gpt3], Gemini [team2024gemini], LLaMA [touvron2023llama] etc., has driven the rapid advancement of GPUs, with their computing power and storage capacity doubling every two to three years [nvidia_blackwell_architecture, nvidia_cudagpus]. Due to the relatively long lifespan of GPUs, different types of GPU machines are often mixed within a computing cluster [hetecluster, weng2022mlaas, jiang2017heterogeneity]. It is foreseeable that the GPU types in future clusters will be even more diversified. An intriguing phenomenon in today’s production clusters is the fluctuation in GPU availability, especially in the spot instance scenarios [amazonec2, microsoftazure], where resources are not always abundant. The high demand for training further exacerbates supply shortages, leading to unacceptable delays for users seeking homogeneous GPUs. Figure 1 illustrates the variation in allocable GPU availability over three days in our cluster. At a given snapshot, homogeneous GPUs may be insufficient for large-scale model training, highlighting the need for advanced frameworks that can efficiently utilize diverse GPU resources to optimize training performance.

Refer to caption — Figure 1: Allocable GPU spot instances over time.

To effectively pre-train LLMs, existing training frameworks [2019megatron, 2021megatron, 2020deepspeed] have made significant advances in optimizing 3D parallelism, which combines data parallelism (DP) [pytorch, rajbhandari2020zero], tensor parallelism (TP) [2019megatron, 2021tp, shazeer2018mesh], and pipeline parallelism (PP) [pipedream2bw, narayanan2019pipedream, huang2019gpipe] to distribute workloads across GPU clusters. While 3D parallelism perform well in homogeneous environments, it typically assume uniformity in GPU resources. This assumption facilitates the coordination of parallel training strategies but becomes a critical limitation in heterogeneous environments. Recent efforts have been devoted to enhancing the performance of heterogeneous GPU training [chen2020semi, ding2021hetseq, duan2022hph, jia2022whale, park2020hetpipe, song2020accpar, zhou2023abs], while most of them are constrained to specific parallelism dimensions, such as adjusting batch-size within DP [zhou2023abs], only TP [song2020accpar], or only DP and PP [duan2022hph, park2020hetpipe]. Whale [jia2022whale] introduces a hardware-aware load balance algorithm designed to distribute workloads within intra-parallelism. SDPipe [miao2023sdpipe] propose a semi-decentralized training system specialized for pipeline-parallel training in dynamic heterogeneity environments. HPH [duan2022hph] and HetPipe [park2020hetpipe] concentrate on layer load balancing under DP and PP, excluding TP from their scope. Metis [um2024metis] presents an efficient algorithm to prune the large search spaces and balance loads with heterogeneity-awareness. This algorithm is built on a few observations that the sizes of differnt PP stages do not vary significantly, and increasing DP is more beneficial than TP.

In this paper, we aim to address three main challenges that existing training systems do not effectively tackle. First, the current 3D parallelism structures and methodologies are overly simplistic in system design and implementation, often leading to sub-optimal parallel strategies. Existing training systems [2019megatron, 2020deepspeed, jia2022whale, 2021megatron] predominantly rely on symmetric GPU distribution and parallel structures, where each parallel group is required to exhibit the same degree of parallelism. This symmetry constraint severely limits the exploration space for optimal parallel strategies in environments with diversified GPU computational and memory capacities. Second, the inherent asymmetry introduced by the first challenge significantly complicates load balancing. Without efficient load balancing, more powerful GPUs may remain underutilized while less capable GPUs become bottlenecks, leading to sub-optimal overall training performance. Third, rapid training recovery in the event of GPU spot instance preemption is rarely studied, despite that the dynamic availability of different GPU types is a key reason for adopting heterogeneous GPUs in training.

To overcome these challenges, we present AutoHet, an automated 3D parallel training system designed for heterogeneous GPU environments. First, we perform a few simple but enlightening experiments to reveal the important properties of training with heterogeneous GPUs. In particular, TP cannot be asymmetric because of its high matrix transpose overhead before All-Reduce communication, while PP can be asymmetric across different DP groups. These observations allow us to eliminate arbitrary combinations of TP and PP in the subsequent planning of model training. Second, we present a two-stage decomposition approach to unravel the complexity of 3D parallel training with heterogeneous GPU types and communication links. At stage one, we formulate a nonlinear mixed-integer programming problem for planning 3D parallelism by maximizing the effective computing power, and obtain the optimal assignment of GPUs in DP groups. At stage two, we map each GPU to a certain PP stage of a DP group in which TP only operates on the GPUs connected via NVLinks, and perform the model partitioning on all the PP stages for computational load balancing. Third, we design an efficient migration strategy to resume training under varying parallelization plans. It prioritizes retrieving checkpoints locally and fetches only the missing ones from cloud storage, significantly reducing recovery time.

We evaluate AutoHet’s training efficiency across three GPU types (A100, H800, and H20) and three model architectures (BERT-Large [2019-bert], LLaMA, and GPT-3) on a platform equipped with 24 GPUs in total. AutoHet achieves 1.38 $\times$ speedup over Megatron-LM for BERT-Large, 1.53/1.27 $\times$ over Megatron-LM/Whale for GPT-3, and up to 1.79/1.51 $\times$ under non-uniform GPU distributions for LLaMA. In scalability analysis with simulated configurations up to 64 GPUs, planning overhead ranges from 1.23-159.12 seconds, with profiling time of 11.9-15.4 minutes, nearly ten times faster than Alpa. AutoHet’s migration strategy delivers up to 4.38 $\times$ faster recovery compared to Varuna [varuna] through optimized checkpoint management.

II Background and Observations

II-A DNN Training Parallelism

Existing LLM parallel training techniques typically include data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP).

1) Data parallelism involves distributing a partitioned dataset across various training nodes [dean2012large]. Each node possesses a full replica of the model, conducts training on its segment of data, and subsequently contributes to gradient exchange and parameter synchronization so as to construct a global model. This iterative process persists until the model attains convergence.

2) Tensor parallelism refers to a method where a neural network tensor is segmented into numerous equal-sized blocks along a specific dimension [2019megatron, 2021megatron]. Consequently, the operations on these tensor fragments are executed in parallel across multiple nodes, enhancing computational efficiency by distributing the workload.

3) Pipeline parallelism refers to an approach wherein a model is segmented into various stages by layers, and each segment is allocated to a different computational node [huang2019gpipe, narayanan2019pipedream, pipedream2bw]. The intermediate result of a preceding node is transmitted to its succeeding node as the input.

4) 3D parallelism. With the ever-increasing model size, using only one type of parallelization is inefficient. The aforementioned techniques are usually combined as a method we call 3D parallelism. A lot of efforts have been devoted to orchestrating 3D parallelism to improve the utilization of GPU streaming cores [2019megatron, 2021megatron, bekman2022bloom, 2020deepspeed].

In what follows, we conduct an in-depth exploration of multidimensional parallelism with heterogeneous GPUs, presenting key insights toward the design of automatic parallelism.

II-B Symmetric Tensor Parallelism with Hetero-GPUs

Observation 1: Tensor parallelism partitioning needs to be symmetric across different DP chains.

We consider a hybrid TP and DP setting with one NVIDIA H800 GPU and two collocated A100 GPUs. Given the discrepancy of computing power in A100 and H800, and the high bandwidth NVLink interconnecting two A100 GPUs, it is reasonable to align two A100 GPUs for TP, and then let this TP group work with H800 GPU for DP (as shown in Figure 2). To synchronize the gradients inside this DP group, an AllReduce operation is executed in each iteration.

The core of existing deep learning frameworks, e.g., [2019megatron, 2020deepspeed] , is the general matrix to matrix multiplication (GEMM). When a transformer block is partitioned on two GPUs, their GEMM process also changes. Taking a two-layer MLP tensor as an example that involves two parameter matrices $A$ and $B$ . We segment $A$ vertically into $A_{1}$ and $A_{2}$ , and segment $B$ horizontally into $B_{1}$ and $B_{2}$ . The gradient computed on each GPU takes the same form as the parameter. Denote $g_{A}$ and $g_{B}$ as the gradients on the parameter matrices $A$ and $B$ , respectively. Here, $g_{A_{1}}$ and $g_{B_{1}}$ are stored as two vectors $[1\;3]$ and $[5\;6]$ in the first A100 GPU, and $g_{A_{2}}$ and $g_{B_{2}}$ are stored as two vectors $[2\;4]$ and $[7\;8]$ in the second A100 GPU. While in H800 GPU, $g_{A}$ and $g_{B}$ are stored as two vectors $[1\;2\;3\;4]$ and $[5\;6\;7\;8]$ . When AllReduce is called, $g_{B_{1}}$ and $g_{B_{2}}$ from two A100 GPUs can be directly aggregated with $g_{B}$ from H800 GPU. However, $g_{A}$ must be transposed to match $g_{A_{1}}$ and $g_{A_{2}}$ for gradient aggregation, which incurs high computation complexity and requires additional temporary storage space.

We modify Megatron-LM in order to support the above-mentioned asymmetric TP. Figure 3 shows the performance overhead of introducing asymmetric TP for different model sizes. Asymmetric setups are created by only adding GPUs to symmetric configurations for asymmetric TP, while other settings are unaltered so as to ensure the identical throughput when such transpose overhead does not exist. The degradation of training throughput ranges from 8% to 49% for the models up to 10B parameters and becomes even worse with larger models. Without changing the existing design of TP, we conclude that the tensor parallelism must be symmetric across different DP chains.

II-C Asymmetric Pipeline Parallelism with Hetero-GPUs

Observation 2: Asymmetric pipeline parallelism demands layer-wise gradient synchronization.

When heterogeneous GPUs are assigned to multiple pipelines, the term “pipeline stage” becomes inconsistent in different data parallel groups. In our scenario, two A100 GPUs are concatenated in a pipeline that works with an H800 GPU for data parallelism. The LLM to be trained consists of four transformer layers, where each A100 GPU stores two consecutive layers, and the H800 GPU stores the entire model, as shown in Figure 4. After backpropagation is completed, an AllReduce operation is performed to synchronize the gradients across the data parallel groups.

In the frameworks optimized for homogeneous GPU training [2019megatron, 2021megatron, 2020deepspeed], all GPUs within the same stage synchronize their gradients, typically communicating in a ring topology where each GPU exchanges data with a single neighboring GPU. The first DP group contains two stages, whereas the second one contains a single stage. H800 GPU transmits layer-4 and layer-3 gradients to the second-stage A100 GPU, and layer-2 and layer-1 gradients to the first-stage A100 GPU. The both A100 GPUs also transmit all their gradients to H800 GPU. In light of this misalignment of pipeline stages among different DP groups, the AllReduce primitive must be adapted to accommodate this form of hybrid parallelism. Intuitively, the Ring AllReduce topology bifurcates if the gradient of a GPU is taken as a unit of data. When the number of DP groups increases, the AllReduce topology becomes extraordinarily complicated. One possible solution is to execute the AllReduce process at the granularity of individual layers, with each layer utilizing a different ring.

II-D Complicated Load Balancing

Observation 3: A dilemma exists between GPU memory utilization and training efficiency on heterogeneous GPUs.

In homogeneous training, TP and PP are devised to conquer the GPU memory limit, and fully utilizing GPU memory is almost equivalent to maximizing the GPU computing power. However, this tenet does not hold in heterogeneous GPU training. Let us use a toy example to illustrate memory filling in a pipeline consisting of two A100 and two H800 GPUs. Recall that the actual computing power of H800 is twice that of A100 in our setting. Considering two memory filling methods, equal model partitioning and proportional model partitioning. In the former, the computing load of each GPU is nearly identical, but their durations of forward and backward passes differ considerably. The ratio of overall idle GPU time amounts to $75\%$ . In fact, the the wasted computing power of an A100 GPU amounts to $\frac{2}{3}$ , and that of an H800 GPU amounts to $\frac{5}{6}$ , which means that equally partitioning the model is computationally inefficient. In the latter, the DNN layers are proportionally assigned to different GPUs according to their relative computing powers, and the fastest GPUs are filled with as many layers as possible. In this situation, a low-quality GPU (e.g., A100) is assigned with fewer layers, but its memory size is usually more than half of the high-quality one (e.g., H800), so its memory is underutilized. It can be readily concluded that balancing computational workloads across heterogeneous GPUs can accelerate the model training. However, it remains essential to consider the utilization of available memory resources.

III AutoHet 3D Parallelism

III-A Design Rationale

We begin with a general cost model of 3D parallel training in which the widely used 1F1B pipeline scheduler [narayanan2019pipedream] is adopted. Denote by $T^{\ast}$ the per-iteration training time:

\displaystyle T^{\ast}

\displaystyle=\min\left(\max_{j\in D}\left\{\sum_{i=1}^{P}t_{i}^{j}+(K-1)\cdot\max_{c\in S}t_{c}^{j}\right\}+T_{sync}\right)

(1)

where $P$ denotes the total number of stages, $K$ is the number of micro-batches, $t_{i}^{j}$ represents the computation time for the forward pass and backward pass of the $i$ -th stage within the $j$ -th data parallel group (including TP and PP communications), and $T_{sync}$ is the time required for gradient synchronization, calculated by dividing the data volume for model synchronization by the lowest communication bandwidth among the data parallel groups.

This cost model abstracts away the underlying complexity of 3D parallelism with heterogeneous GPUs and diverse communication links. When all the GPU are homogeneous, we only need to specify the number of GPUs assigned to TP, PP and DP groups, and the model partitioning is symmetric. However, in a heterogeneous scenario, the optimal parallelization strategy involves selecting appropriate DP, TP, and PP groupings for each GPU while taking into account workload distribution, GPU resource constraints, and host server associations, which results in significant complexity. To address this challenge, we decompose the problem into two stages: i) computing power maximization and ii) GPU mapping and model partitioning.

The workflow of AutoHet is illustrated in Figure 5. At the beginning, it generates multiple balanced 3D parallel device grouping plans based on the input model and node specifications, temporarily disregarding their physical locations. GPU placement optimization follows within each plan, determining an efficient mapping of nodes and pipeline stages. The model is then partitioned across the pipeline stages for load balancing, and the best plan is selected through lightweight profiling and cost estimation.

III-B Effective Computing Power Maximization

Our goal is to assign heterogeneous GPUs for 3D parallelism by considering the balancing of computational load on different data parallel groups. A device group is defined as a collection of an arbitrary number of homogeneous or heterogeneous GPUs that collectively handle a complete model. To ensure consistent model accuracy and synchronous gradient aggregation, the computing power between device groups must be kept as balanced as possible without modifying the batch size. We model this as a nonlinear integer optimization problem to determine the optimal device grouping plan.

Node specification. AutoHet supports heterogeneous GPU clusters, allowing for variations in both the number and type of GPUs across different nodes. The heterogeneous GPU configuration $S$ is formulated as a set of 3-tuples, such as {(0, 8, A100), (1, 4, H800), …}, indicating the presence of eight A100 GPUs on node 0, four H800 GPUs on node 1, and so forth.

Problem formulation. The total number of GPUs is denoted by $N$ . The variable $x_{i,j}\in\left\{0,1\right\}$ represents whether the $i^{th}$ GPU is assigned to the $j^{th}$ DP group. Similarly, $y_{j}\in\left\{0,1\right\}$ indicates the presence of at least one GPU in the $j^{th}$ DP group. Since each DP group must contain at least one GPU, the number of DP groups cannot exceed $N$ . The memory capacity of each GPU is denoted as $m_{i}$ for the $i^{th}$ GPU. Because TP need to be symmetric (as detailed in Section II-B), the modeling can be simplified to focus on the combination of DP and PP. We introduces a new metric called effective computing power, symbolized by $G_{j}$ . This metric quantifies the actual computing power of the $j^{th}$ DP group.

G_{j}={\textstyle\sum_{i=1}^{N}}g_{i}\cdot x_{i,j}\ \cdot\left(1-\rho_{j}\right)

(2)

where $g_{i}$ quantifies the computing power of the $i^{th}$ GPU, while $\rho_{j}$ denotes the pipeline bubble ratio in $j^{th}$ DP group. The problem formulation is defined in Equation (3a). The core idea is to maximize the product of the number of valid DP groups and the minimum effective computing power, thereby minimizing the per-iteration time.

In the objective function, we sum over all variables $y_{j}$ to determine the total number of valid DP groups. The variable $z=\min_{j}\left\{{G_{j}}\right\}$ is introduced to represent the minimum effective computing power across all valid DP groups.


$\displaystyle\underset{\left\{x_{i,j}\right\}}{\text{Maximize}}\qquad$	$\displaystyle\sum_{j=1}^{N}y_{j}\cdot z$	(3a)
Subject to:	$\displaystyle\sum_{i=1}^{N}m_{i}\cdot x_{i,j}+L\cdot(1-y_{j})\geq\text{MIN}_{\text{mem}},\ \forall j;$	(3b)
	$\displaystyle G_{j}\cdot y_{j}+L\cdot(1-y_{j})\geq z,\qquad\forall j;$	(3c)
	$\displaystyle\frac{1}{L}\cdot\sum_{i=1}^{N}x_{i,j}\leq y_{i}\leq\sum_{i=1}^{N}x_{i,j},\qquad\forall j;$	(3d)
	$\displaystyle\sum_{j=1}^{N}x_{i,j}=1,\qquad\forall i.$	(3e)

Here, Constraint (3b) ensures that each DP group is equipped with sufficient memory for training. We profile the minimum required memory $\text{MIN}_{\text{mem}}$ during model training. Constraint (3c) limits the range of minimum effective computing power in all valid DP groups. $L$ is a sufficient large constant, introduced as an auxiliary variable to handle the special case when a DP group is empty. Constraint (3d) characterizes the value of $y_{j}$ , while Constraint (3e) ensures that each GPU is assigned to exactly one DP group. Due to the reduced complexity, we compute the optimal solution using the math solver SCIP [scip] directly.

III-C GPU Mapping and Model Partitioning

Once the effective computation power is maximized and balanced across all DP groups, AutoHet needs to materalize this allocation strategy, i.e. mapping each model partition on every GPU. To meet the communication and storage constraints, the following principles are developed.

GPU node mapping. The mapping of GPU nodes determines the allocation of communication bandwidth across different parallel dimensions. In this context, we consider two levels of communication bandwidth: high-speed intra-node communication via NVLink (e.g., 600 GB/s for A100 GPUs) and low-speed inter-node communication via RDMA(e.g., 400Gb/s). Bandwidth allocation is prioritized according to communication volume, with TP operations receiving the highest priority, followed by DP, and then PP. Specifically, we ensure that all communications between TP operations are routed through NVLink, with any remaining NVLink bandwidth allocated to DP groups to maximize its utilization.

Pipeline stage mapping. We find that earlier stages require more memory to store forward activations, aligning with our observation of underutilized memory in lower-end GPUs (detailed in §II-D). Futhermore, since computation and communication overlap in pipeline parallelism, communication efficiency is mainly influenced by the earlier stages. Lower-end GPUs handle smaller workloads and generate less communication, so AutoHet assigns these GPUs to the earlier pipeline stages where memory demand is high but the computational load is low.

Building on the aforementioned principles, AutoHet develops a heuristic algorithm for mapping GPUs to specific physical nodes and pipeline stages. The key idea is to assign GPUs with lower communication overhead and higher available memory to the earlier pipeline stages, while allocating bandwidth sequentially based on communication priorities. Notice that AutoHet initiates the process by identifying valid TP dimensions (Line 2), which require the number of GPUs per node to be an integer multiple of the TP dimension. During subsequent processing steps, GPUs assigned to a TP group are treated as a single entity. The alogrithm begins by sorting all GPU types in the $type\_set$ according to their computing power. It then iteratively selects the GPU type with the lowest computing power, checking two conditions: i), whether each DP group has an unassigned GPU of this type; and ii), whether there are available GPUs on the same node for all DP groups. If both conditions are met, the algorithm assigns the GPU type to the earliest unassigned pipeline stage and allocates the corresponding rank on the physical node. This process continues until NVLink communication requirements between DP groups can no longer be fulfilled. Any remaining pipeline stages and physical nodes are then assigned in sequence.

Load balancing in pipeline parallelism aims to distribute the workload efficiently across all stages to avoid bottlenecks and ensure optimal resource utilization. To address this, we introduces an optimization model for effective model partitioning.

Problem formulation. Let $P$ be the number of PP stages in a DP group. $p_{i}$ represents the pp stage where the $i^{th}$ gpu is located, $l_{i}$ denote the number of model layers allocated to the $i^{th}$ GPU. $N_{layers}$ represent the total number of model layers. The objective of the optimization problem is to allocate model layers across GPUs in a manner that best aligns with their respective computing power:


$\displaystyle\underset{\left\{1\leq i\leq P\right\}}{\text{Minimize}}\ \ \quad$	$\displaystyle\max_{i}\left\{\frac{g_{i}}{l_{i}}\right\}$	(4a)
Subject to:	$\displaystyle N_{layers}=\sum_{i=1}^{P}l_{i};$	(4b)
	$\displaystyle\text{MEM}_{\text{F}}(l_{i})+\text{MEM}_{\text{V}}(l_{i},p_{i})\leq m_{i},\quad\forall i.$	(4c)

Here, Equation (4b) is a natural constraint of storing all the layers of the large language model. Constraint (4c) restricts the allocation of layers to each stage, ensuring that the number of layers assigned does not exceed the stage’s memory capacity. We profile both the fixed memory components $\text{MEM}_{\text{F}}(l_{i})$ (e.g., model parameters, gradients, and optimizer states) and the variable memory components $\text{MEM}_{\text{V}}(l_{i},p_{i})$ (e.g., forward activations) during model training.

III-D Profiling Acceleration

To execute the 3D parallelism planning algorithm, AutoHet requires the computation time for each pipeline stage and peak memory to avoid out-of-memory (OOM) errors. We propose profiling acceleration strategies to expedite training initiation.

Runtime profiling. The computation time for each pipeline stage is influenced by factors such as GPU types, workload distribution, and TP dimensions, leading to substantial profiling overhead. However, for repetitive architectures (e.g., GPT-2, LLaMA), we observe that the runtime of multiple model layers can be approximated by the cumulative runtime of individual layers with negligible error. Therefore, we adopt a binary decomposition approach, profiling iteration times for layer counts that are powers of two (e.g., 1, 2, 4, 8). Arbitrary layer counts are then represented as sums of these pre-profiled powers of two, as expressed in the following equation:

\displaystyle T_{gpu}^{tp}(n)

\displaystyle=\sum_{i=0}^{k}\nolimits\alpha_{i}\cdot T_{gpu}^{tp}(2^{i})

(5)

where $T_{gpu}^{tp}(n)$ is the estimated iteration time for $n$ layers, $\alpha_{i}$ is a coefficient indicating the presence of the layer block in the decomposition of $n$ , and $k=\lfloor\log_{2}{n}\rfloor$ .

Memory profiling. To reduce overhead, we introduce pruning strategies based on two observations: i), in repetitive model structures, memory consumption is mainly determined by the number of layers, with minimal dependence on their starting or ending points; ii), peak memory usage scales predictably with layer count. Consequently, we profile memory usage for a single model layer across different TP dimensions and derive the memory usage for multiple layers by summing the memory usage of individual layers.

Input : Node specification

S

; Model config

\xi

;

Output : Optimal 3D parallel plan

P^{\ast}

3 /* Initialize the valid TP dimensions */

\mathrm{(t_{1},t_{2},\ldots,t_{n})}\leftarrow getValidTpSize(U)

\leftarrow estimateMemory(\xi)

; X

\leftarrow\left[\ \right]

;

T^{\ast}\leftarrow\infty

8for $tp\_dim\in(t_{1},t_{2},\ldots,t_{n})$ do

9 /* Modeling device grouping */

10 status, plan

\leftarrow groupingDevice(U,M,\xi)

11 if status then

12 Plans

\leftarrow append(plan)

13for $plan\in Plans$ do

14 stage, node

\leftarrow mapNodeAndStage(U,plan)

15 /* Modeling stages load balancing */

\mathrm{P}

\leftarrow balanceWorkload(stage,node,plan,\xi)

17 if $Cost(P)<T^{\ast}$ then

T^{\ast}=Cost(P)

P^{\ast}

\mathrm{P}

return

P^{\ast}

Algorithm 1 3D Parallel Planning Algorithm

To summarize, Algorithm 1 demonstrates AutoHet’s 3D parallel planning algorithm for identifying better parallelism plans. It consists of three parts: 1) grouping devices to achieve a balanced computing power (Lines 3-8); 2) mapping GPU to nodes and stages to optimize communication efficiency (Line 10); and 3) balancing the workload across PP stages (Line 12). After generating several promising plans, AutoHet estimates the iteration times and select the optimal parallelism plan (Lines 14-16).

IV Elastic Training Recovery

IV-A Challenges of Training Recovery

The availability of spot instances is time varying, depending on the demand of high-priority tasks. When training with GPU spot instances, a GPU currently in use can be preempted at the next billing slot, or a set of GPUs become available, which causes the updating of the parallelization plan. The reconfiguration overhead needs to be minimal in order to maintain continuity and efficiency in the parallel training process. Existing training frameworks such as DeepSpeed and Megatron-LM can use checkpointing strategies to handle spot instance preemption. Though feasible, these strategies are inefficient since the checkpoint is stored as a file at the GPU granularity, and transmitted to the cloud storage server. Consider a Llama-2 13B model, the checkpoint contains the full-precision optimizer state and the half-precision model weight, totaling 180GB in size. Thus, the uploading time to the cloud server and the downloading to all the participated GPUs is throttled by the communication bottleneck between the storage server and the training nodes.

The communication bandwidth between training nodes is typically an order of magnitude higher than that between a training node and the storage server. Therefore, a natural strategy is to retrieve training states from local nodes whenever possible and only fetch the missing pieces from the cloud storage server. This approach resembles failure-induced checkpoint recovery [gimini, gandhi2024recycle], but with key differences. In the prior literature, the entire checkpoint file of a GPU is replicated to a recovery node, whereas AutoHet adopts a more fine-grained strategy: it generates checkpoint files at the layer level, tracks the physical locations of model partitions after each update to the 3D parallelization plan, and re-partitions the checkpoint to align with the new parallelization configuration.

IV-B Design of Training Recovery

IV-B1 Layer-wise Checkpoint Generation

To handle dynamically changing parallelization plan, AutoHet adopts a layer-wise checkpoint generation method and a hierarchical checkpoint storage scheme. In PyTorch, state_dict is a dictionary object used to save all crucial information during training including model parameter values as well as optimizer states such as momentum and variance. In general, state_dict is saved directly using the torch.save function and loaded during recovery, while it is no longer in a symmetric and standard shape in distributed training with heterogeneous types of GPUs. AutoHet adopts a layer-wsie checkpoint generation procedure since a layer is the minimum unit of LLMs under different parallelization plans.

In AutoHet, we filter the parameters of each layer by traversing all the key-value pairs of state_dict. Once each layer’s parameters are identified, they are extracted from the original state_dict and stored into a new dictionary object called layer_dict. layer_dict is organized by layers and contains only the parameters related to specific model layers, allowing for precise loading during recovery. Similarly, optimizer states are also stored at the layer granularity. AutoHet creates an optimizer_dict that records the optimizer state for each layer including its momentum and variance, and stores them as independent dictionary entries. After filtering and reorganizing both model parameters and optimizer states, AutoHet periodically saves the layer_dict and optimizer_dict to CPU memory and disk using PyTorch’s torch.save function, and replicates them to the cloud storage server. Note that checkpoints should not be stored solely in CPU memory due to its volatile nature. In Kubernetes-managed environments, memory is cleared when processes are preempted or containers are rescheduled. Host machine SSDs provide persistent storage, ensuring data continuity and enabling recovery in such scenarios.

IV-B2 Adaptive Checkpoint Loading

After the parallelization plan changes, the training states need to be redistributed on the new set of GPUs. We design a parameter-level checkpoint loading scheme to flexibly adapt to this change. Specifically, it handles the following three scenarios:

i) Unaltered TP dimension. As shown in Figure 6(a), we suppose all GPUs are of the same model and the tensor parallelism (TP) dimension remains fixed at 1 for simplicity. Before preemption, rank 0–2 GPUs are pipeline-parallel, evenly splitting six model layers, with each GPU handling two layers of parameters. When rank 2 is preempted, rank 0 and rank 1 take over three layers each. Since the parameter partitioning has not changed, each current rank only needs to read the checkpoint files corresponding to its model layer numbers and TP rank ID. For example, rank 0 will read checkpoints $0\_0$ , $1\_0$ , and $2\_0$ (the first digit for rank ID and the second digit for TP partition ID), corresponding to TP rank 0 parameters for layers 0 to 2.

ii) Increased TP dimension. In this case, directly loading layer-level checkpoints is not feasible, as their formats are not aligned and they must be re-partitioned according to the new TP dimension. In Figure 6(b), after the preemption of ranks 4 and 5, the TP dimension increases from 2 to 4. Taking the loading of layer 0 as an example, the full parameters of layer 0 were initially stored in checkpoints $0\_0$ and $0\_1$ . After the updating of the parallelization plan, the TP ranks expand to GPUs 0, 1, 2, and 3. Rank 0 and 1 need to read checkpoint $0\_0$ , while ranks 2 and 3 read checkpoint $0\_1$ . AutoHet performs a split operation on each parameter matrix along the corresponding dimension in the checkpoint files, slices them into smaller blocks, and reorganizes them into a complete state_dict for each rank to load its required parameters.

iii) Decreased TP dimension. As shown in Figure 6(c), when ranks 3, 4, and 5 are preempted, the TP dimension decreases from 2 to 1. Again taking the loading of layer 0 as an example, after preemption, TP rank reduces to GPU 0. Here, TP rank 0 needs to read both checkpoints $0\_0$ and $0\_1$ . AutoHet performs the concatenation operation on each parameter matrix from the two checkpoint files along the corresponding dimension, merges them into complete parameter blocks, and reorganizes them into a complete state_dict.

IV-C Accelerated Recovery

AutoHet designs an accelerated recovery strategy by leveraging a layer bitmap to record the physical locations of layer-wise checkpoints. Notably, a GPU spot instance may be reclaimed before the latest checkpoints are written to local storage, making them unavailable on the training nodes and only accessible from the cloud storage server. Hence, we consider two checkpoint transmission scenarios as follows.

The first scenario occurs when, after resource changes, the checkpoints stored locally cannot be combined to form the complete model parameters and optimizer states. According to the layer bitmap, AutoHet identifies which parameter layers each GPU needs and determines whether those parameters are already stored locally on that GPU. AutoHet prioritizes to load those checkpoints stored at the local disk or CPU memory, and only fetches the missing pieces from the remote cloud.

The second scenario is that the complete model parameters and optimizer states can be retrieved for the local training nodes. AutoHet utilizes the RDMA links between training nodes to redistribute the training states, instead of downloading the checkpoints from the cloud storage server.

V Evaluation

We implement AutoHet in Python with 4000+ LOC (3165 for 3D parallelism and 1204 for elastic recovery), and our system modeules are integrated into Megatron-LM. Due to limited space, we do not elaborate the detailed implementation, but will open-source it later on.

Experimental setup. We conduct our experiments using three types of NVIDIA GPUs: (i) A100 with 80GB HBM, (ii) H800 with 80GB HBM, and (iii) H20 with 100GB HBM. Our platform consists of four nodes, each equipped with eight GPUs. Specifically, Node 0 and Node 3 are A100 nodes, Node 1 is an H800 node, and Node 2 is an H20 node. Intra-node GPU communication is facilitated by NVLink, while inter-node communication uses 400Gbps RoCEv2. We evaluate three widely adopted LLMs including BERT-Large, GPT-3, and LLaMA, all of which utilize the Transformer architecture. Note that Node 3 is used exclusively in the evaluation of elastic recovery strategies.

V-A End-to-End Parallelization Performance

We first compare the end-to-end training performance (tokens/s) of AutoHet against two SOTA training systems: Megatron-LM and Whale. Since both systems lack automatic parallelism plans, we report their best-performing results m the parallel structures supported by each system. To evaluate AutoHet’s effectiveness, we cosnider two settings: a uniform GPU distribution, where each node is allocated with an equal number of GPUs; a non-uniform distribution, catering for real-world heterogeneous GPU environments, where the uniform GPU provisioning is often impractical.

V-A1 Uniform GPU Distribution

We evaluate two GPU type combinations: H800+A100 and A100+H20, with each node configured with 2, 4, or 8 GPUs. We select two model architectures: BERT-Large with 340M parameters and GPT-3 with 6.7B parameters. The training performance is displayed in Figure 7.

BERT-Large results. AutoHet achieves an average training throughput 1.38 $\times$ higher than Megatron-LM across all experiments. Due to the relatively small size of BERT-Large, it can be stored in a single GPU of any type. Hence, Megatron-LM directly adopts the full data parallelism despite GPU heterogeneity, which causes the severe straggler problem. Whale readjusts the batch size on each GPU according to its computing power (referred to as “Intra-TaskGraph load balance”), enabling it to achieve effective performance under these conditions. AutoHet employs a hybrid approach that combines PP and DP, while implementing load balancing across different pipeline stages. Nevertheless, AutoHet automatically generates parallelism plan and attains comparable performance without adversely affecting convergence.

GPT-3 results. Across all experiments, AutoHet outperforms Megatron-LM and Whale on average 1.53 $\times$ and 1.27 $\times$ in training throughput. Performance improvement can be attributed to two major factors: i), unlike BERT-Large, GPT-3 cannot be stored on a single GPU due to its relatively large model size, necessitating model parallelism. Workload balancing method in AutoHet addresses this by efficiently distributing model layers across heterogeneous GPUs. In contrast, Megatron-LM assumes homogeneous GPUs, resulting in a uniform layer division that fails to exploit the performance variations across different GPU types; ii), AutoHet schedules GPUs with lower computing power to earlier pipeline stages, enabling GPUs with higher computational power to handle larger portions of the workload. This is in contrast to Megatron-LM and Whale, which allocate stages based on a sequential GPU node order without considering the specific performance characteristics of each node.

V-A2 Non-uniform GPU Distribution

We select the LLaMA model with 6.7B parameters and evaluate two distinct GPU type combinations: H800+A100 and A100+H20. Figure 8 illustrates the training performance across various combinations of GPU quantities.

For the H800+A100 combination, AutoHet significantly outperforms Megatron-LM and Whale by average factors of 1.79 $\times$ and 1.51 $\times$ in training throughput, respectively. The main reason for this improvement is AutoHet’s support for asymmetric parallel structures, whereas the baseline systems are constrained by symmetric parallelism. For instance, in the 4 $\times$ A100+2 $\times$ H800 experimental setting, AutoHet can configure a TP degree of two, with H800 models forming individual DP groups and A100 models constituting another DP group with a two-stage pipeline. In contrast, Megatron-LM and Whale are limited to forming costly long pipeline parallel structures. Similarly, in configurations such as 5 $\times$ A100+3 $\times$ H800 and 3 $\times$ A100+5 $\times$ H800, when the number of GPUs per node is odd (preventing the formation of TP groups), AutoHet can create a more balanced combination of pipeline and data parallelism, while Megatron-LM and Whale are unable to achieve load balancing across model layers due to their inability to support an inconsistent number of layers within the same stage across different DP groups.

For the A100+H20 combination, we design experimental settings with larger quantity disparities between GPU types. Across all configurations, Megatron-LM and Whale are constrained to adopt pure pipeline parallelism approaches, as they cannot support structures with inconsistent GPU counts across different data parallel groups. In contrast, AutoHet demonstrates remarkable flexibility in its parallel structures. For example, in the 1 $\times$ A100+4 $\times$ H20 experimental setting, AutoHet can form two DP groups: one comprising 1 $\times$ A100+1 $\times$ H20, and another consisting of 3 $\times$ H20, with pipeline parallelism employed within each DP group. This adaptive parallelization strategy enables AutoHet to achieve average speedups of 1.44 $\times$ and 1.16 $\times$ over Megatron-LM and Whale respectively.

V-B Breakdown Analysis and System Overheads

We conduct the performance breakdown experiments on two heterogeneous GPU configurations: 4 $\times$ A100+4 $\times$ H800 and 8 $\times$ A100+8 $\times$ H800, as shown in Figure 9. Since both experimental settings demonstrate similar improvement trends, we analyze the results m the 4 $\times$ A100 + 4 $\times$ H800 setup to evaluate the contribution of each component of AutoHet. Using GPT-3 6.7B as a case study and basic pipeline parallelism training as the baseline, we measure the cumulative benefit of incrementally adding each module compared to the baseline. We observe that the device grouping module achieves a 1.11 $\times$ increase in throughput, and this gain stems m the reduced bubble ratio in pipeline parallelism. The node and stage mapping module further improves the throughput gain to 1.16 $\times$ higher than the baseline. The workload balancing between stages contributes a 1.79 $\times$ throughput gain, in which the transformer layers are appropriately allocated to the GPUs of heterogeneous computing powers along the pipelines.

The profiling and planning overheads are crucial to fairly evaluate the performance of AutoHet, especially in the context of spot instance training. We directly use SCIP [scip] to solve the non-linear integer programming problem in AutoHet. When the number of GPUs are in the set $\{16,24,32,64\}$ , the planning times for search the optimal parallelization strategies are $\{1.23,5.72,16.96,159.12\}$ seconds, which are acceptable to practical spot instance training scenarios. In contrast, Alpa [zheng2022alpa] demands 240 minutes to search for the intra- and inter-operator parallelization strategy with homogeneous GPUs, as reported in [um2024metis]. The reason of such improvement stems m that a huge amount of infeasible schemes have been eliminated in the mathematical framework of AutoHet based on our observations and domain specific simplifications. We further measure the realistic runtime of each layer, and emulate the profiling time overheads using emulations. With the same set of GPUs, the profiling time of AutoHet increases m 11.9 to 15.4 minutes, and this overhead does not greatly scale up with the number of GPUs or the model size. Compared with Alpa that requires the profiling time of 209 minutes [um2024metis], AutoHet is nearly ten times faster.

V-C Elastic Recovery Time

Recovery time is the metric that measures the time required for training to continue after preemption and recovery. We hereby consider different resource configurations, parallel strategies and model scales to comprehensively evaluate the AutoHet’s recovery efficiency. Our baseline strategy, Varuna [varuna], is a low-cost elastic training system designed for spot instances. It supports a hybrid data and pipeline parallel training paradigm, and enables the hierarchical checkpoint storage and loading. Since Varuna does not support checkpoint recovery under tensor parallelism, we only compare its checkpoint fetching strategy with AutoHet. The experiments were conducted on GPT-3 models of sizes 3B, 6.7B, 13B, and 20B, with cloud bandwidth set to 1200MB/s [alibaba_extreme_nas_2025] and local storage utilizing NVMe SSDs achieving 3500MB/s end-to-end checkpoint loading bandwidth.

Figure 10 shows the recovery times of Varuna and AutoHet under three different scenarios. In scenario A, node $N_{0}$ has 8 A100 GPUs and node $N_{1}$ has 8 H20 GPUs. The current 3D parallelization strategy consists of four DP groups with each group uses 2 A100 GPUs and 2 H20 GPUs for PP and TP. When two DP groups (4 A100 and 4 H20 GPUs) are completely preempted, the local nodes maintain complete checkpoint replicas, enabling direct local access to all required training states. The training of Varuna pauses and downloads the checkpoint m the cloud storage for recovery, while AutoHet simply loads the checkpoint locally, significantly reducing the recovery time and achieving a 4.38 $\times$ speedup. In scenario B, the eight GPUs of node 0 are preempted so that the original parallelization strategy is changed to two DP groups, each group containing 4 H20 GPUs operated in TP. Only the part of the checkpoint is available locally according to the constructed layer bitmap, and thus missing part is retrieved m the cloud, achieving a 1.49 $\times$ speedup compared with Varuna. The scenario C emulates the increase of avaialable spot GPU instances. We augment two nodes, $N_{2}$ with 2 A100 GPUs and $N_{3}$ with 2 H20 GPUs. The new parallelization strategy contains one more DP group, and the training state can be obtained entirely m local machines through RDMA links. In contrast, the cloud-based retrieval strategy becomes increasingly inefficient as the number of DP groups scales up, requiring the download of larger volumes of complete model parameters. This scalability limitation further demonstrates the superiority of AutoHet’s accelerated recovery approach which is 3.59 $\times$ faster than Varuna.

VI Conclusion

We present AutoHet, an automated parallelization framework designed for distributed training on heterogeneous spot instance GPU clusters. By addressing the challenges of implementing 3D parallelism in diverse GPU environments, AutoHet supports asymmetric parallel structures and employs a novel 3D parallel planning algorithm to optimize workload distribution and minimize overheads. Our comprehensive experiments across three distinct GPU types and three different model architectures demonstrate that AutoHet achieves up to 1.79× throughput improvement over existing systems, and delivers 1.49× to 4.38× speedups in elastic recovery time.