SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication

arXiv cs.LG Papers

Summary

SCAPE is a communication-efficient distributed optimizer that leverages first-moment statistics to enable extreme sparsification for LLM training, preserving accuracy while reducing wall-clock time by up to 43.3%.

arXiv:2607.01678v1 Announce Type: new Abstract: Communication increasingly dominates the cost of Large Language Model (LLM) pre-training, especially under data-parallel and sharded training schemes, where gradient synchronization and parameter reconstruction overhead increase with model size and system scale. Existing communication-reduction methods either sparsify raw gradients, which can be unstable for modern Adam-style optimizers at high sparsity, or quantize communication, whose savings are fundamentally bounded by bit width and often incur additional runtime overhead. We present SCAPE, a communication-efficient distributed optimizer for LLM training that exploits the stability of AdamS's first-moment to enable aggressive sparsification without loss of LLM quality. Instead of constructing masks from raw gradients, SCAPE derives them from first-moment-based statistics, partitions mask generation across workers to align with optimizer sharding, and delays mask usage by one step so that mask synchronization can overlap with computation. SCAPE also reconstructs the quantities required for second-moment updates from a single synchronized sparse buffer, avoiding an additional collective. We implement SCAPE in Megatron-LM and evaluate its convergence by pre-training GPT-345M on OpenWebText and Llama-500M on SlimPajama-6B using 32 NVIDIA GH200 GPUs on TACC Vista. In both models, SCAPE preserves training stability, validation loss, and downstream task accuracy under 90\% and 99\% sparsity. For Llama-500M, SCAPE reduces end-to-end pre-training wall-clock time by up to 43.3\% while maintaining model quality comparable to dense AdamW and AdamS. For Llama-1.8B, SCAPE achieves up to 3.26$\times$ speedup per step compared to dense AdamS.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:43 AM

# SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication
Source: [https://arxiv.org/html/2607.01678](https://arxiv.org/html/2607.01678)
Mingkai Zheng, Junlin Chen, Haotian Xie, and Zhao Zhang Rutgers University \{mz687, junlin\.chen110, haotian\.xie, zhao\.zhang\}@rutgers\.edu

###### Abstract

Communication increasingly dominates the cost of Large Language Model \(LLM\) pre\-training, especially under data\-parallel and sharded training schemes, where gradient synchronization and parameter reconstruction overhead increase with model size and system scale\. Existing communication\-reduction methods either sparsify raw gradients, which can be unstable for modern Adam\-style optimizers at high sparsity, or quantize communication, whose savings are fundamentally bounded by bit width and often incur additional runtime overhead\. We present SCAPE, a communication\-efficient distributed optimizer for LLM training that exploits the stability of AdamS’s first\-moment to enable aggressive sparsification without loss of LLM quality\. Instead of constructing masks from raw gradients, SCAPE derives them from first\-moment\-based statistics, partitions mask generation across workers to align with optimizer sharding, and delays mask usage by one step so that mask synchronization can overlap with computation\. SCAPE also reconstructs the quantities required for second\-moment updates from a single synchronized sparse buffer, avoiding an additional collective\. We implement SCAPE in Megatron\-LM and evaluate its convergence by pre\-training GPT\-345M on OpenWebText and Llama\-500M on SlimPajama\-6B using 32 NVIDIA GH200 GPUs on TACC Vista\. In both models, SCAPE preserves training stability, validation loss, and downstream task accuracy under 90% and 99% sparsity\. For Llama\-500M, SCAPE reduces end\-to\-end pre\-training wall\-clock time by up to 43\.3% while maintaining model quality comparable to dense AdamW and AdamS\. For Llama\-1\.8B, SCAPE achieves up to 3\.26×\\timesspeedup per step compared to dense AdamS\.

## IIntroduction

Large Language Models \(LLMs\) with ever\-increasing sizes have achieved unparalleled performance across many fields, including math reasoning\[[1](https://arxiv.org/html/2607.01678#bib.bib44)\], code generation\[[4](https://arxiv.org/html/2607.01678#bib.bib45)\], and autonomous laboratory\[[30](https://arxiv.org/html/2607.01678#bib.bib43)\]\. State\-of\-the\-art LLM training requires a massive amount of graphics processing units \(GPUs\) to achieve faster training and accommodate enormous model parameters and optimizer states\. The legacy data parallel \(DP\) strategy replicates the model and optimizer states \(e\.g\., first\- and second\-momentum in AdamW\[[13](https://arxiv.org/html/2607.01678#bib.bib2)\]\) across GPUs and distributes a mini\-batch of data among them\. Modern sharded data parallel strategies, such as ZeRO\[[22](https://arxiv.org/html/2607.01678#bib.bib11)\], FSDP\[[40](https://arxiv.org/html/2607.01678#bib.bib12)\], and Megatron\-LM\[[17](https://arxiv.org/html/2607.01678#bib.bib10)\], reduce spatial redundancy by partitioning the model and optimizer states across GPUs\. In either case, communication \(i\.e\.,all\-reduceandall\-gather\) is the scaling bottleneck given the dependency on model sizes and the scale, as shown in[Figure 1](https://arxiv.org/html/2607.01678#S1.F1)\.

Lowering communication volume is an effective way to reduce communication overhead\. Researchers have explored several approaches\. DGC\[[12](https://arxiv.org/html/2607.01678#bib.bib14)\], DeMo\[[19](https://arxiv.org/html/2607.01678#bib.bib19)\], EDGC\[[36](https://arxiv.org/html/2607.01678#bib.bib5)\], oktopk\[[11](https://arxiv.org/html/2607.01678#bib.bib46)\], and Radius\[[41](https://arxiv.org/html/2607.01678#bib.bib4)\]communicates top\-kkgradients with error feedback to preserve model performance\. A second line of research exploits quantization \(low\-bit representation\) techniques\. QSDP\[[14](https://arxiv.org/html/2607.01678#bib.bib42)\], ZeRO\+\+\[[34](https://arxiv.org/html/2607.01678#bib.bib41)\], and SDP4bit\[[9](https://arxiv.org/html/2607.01678#bib.bib40)\]reduce the number of bits for model parameters and gradients in the sharded data parallel strategy\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/x1.png)Figure 1:Scaling bottleneck for pre\-training Llama\-500M \(sequence length 4K\) and Llama\-1\.8B \(sequence length 2K\) using Megatron\-LM with distributed optimizer on NVIDIA GH200 GPUs\. Llama\-1\.8B uses a sequence length of 2K due to the limited memory of GH200\.Existing top\-kkmethods are limited in applicability to modern optimizers or require low sparsity due to constraints on model performance\. Legacy top\-kkmethods, such as DGC\[[12](https://arxiv.org/html/2607.01678#bib.bib14)\]and DeMo\[[19](https://arxiv.org/html/2607.01678#bib.bib19)\], are designed to sparsify the momentum in the SGD\[[29](https://arxiv.org/html/2607.01678#bib.bib16)\]optimizer, which is rarely used in today’s LLM training\. Radius\[[41](https://arxiv.org/html/2607.01678#bib.bib4)\]proposes an error\-correction mechanism for the AdamW\[[13](https://arxiv.org/html/2607.01678#bib.bib2)\]optimizer, and can achieve 40% sparsity without degrading downstream task performance\. The communication volume reduction capability of quantization methods, including QSDP\[[14](https://arxiv.org/html/2607.01678#bib.bib42)\], ZeRO\+\+\[[34](https://arxiv.org/html/2607.01678#bib.bib41)\], and SDP4bit\[[9](https://arxiv.org/html/2607.01678#bib.bib40)\], is inherently limited by the number of bits required to represent each gradient\. Furthermore, without NCCL backend support, these quantization\-based methods rely on expensiveall\-to\-allcommunication, imposing high memory and communication overhead\.

In this work, we ask the following research question: Can we reduce communication volume to the extreme without harming the model performance? To this end, we proposeSCAPE, a communication\-efficient distributed optimizer that sparsifies the optimizer states rather than gradients\. SCAPE is built upon AdamS\[[39](https://arxiv.org/html/2607.01678#bib.bib7)\]and is inspired with two important insights:

- •Compared to AdamW, AdamS exhibits substantially smaller residual growth under high sparsity in LLM pretraining, indicating improved robustness to stale error\-feedback dynamics\.
- •The top\-kkstructure of the first\-moment is temporally stable across two adjacent steps, suggesting that delayed mask reuse is practical\.

Based on these observations, SCAPE introduces a partitioned mask\-refresh mechanism aligned with optimizer sharding\. SCAPE computes masks from momentum\-based statistics rather than raw gradients, and delays their use by one step so that mask synchronization can overlap with computation\. In addition, SCAPE reconstructs the quantities needed for second\-moment updates using a single synchronized sparse buffer, avoiding a secondall\-reduce\. When sharded data parallel is enabled, with the model and optimizer states distributed across all GPUs, SCAPE compresses the volume using sparsity for bothreduce\-scatterfor gradient synchronization andall\-gatherfor reconstructing the model\.

We implement SCAPE using Megatron\-LM\[[17](https://arxiv.org/html/2607.01678#bib.bib10)\]and empirically verify its convergence through pre\-training GPT\-345M on the OpenWebText\[[7](https://arxiv.org/html/2607.01678#bib.bib6)\]dataset and Llama\-500M on the SlimPajama\-6B\[[27](https://arxiv.org/html/2607.01678#bib.bib18)\]dataset\. We run experiments on the TACC Vista supercomputer with evaluation on an extensive suite of downstream tasks\. Our experiment results show that SCAPE with 90% and 99% sparsity can reduce the wall\-clock time for Llama\-500M pretraining on 32 GH200 GPUs by 35\.6% and 43\.3%, respectively\. More importantly, under such high sparsities, SCAPE maintains final training and validation losses close to those of dense AdamS and AdamW, without affecting their performance on downstream benchmarks, such as LAMBADA, SuperGLUE, PIQA, MMLU, and ARC\.

SCAPE is expected to be effective across GPU clusters with various GPU and interconnect configurations, though the improvements it achieves may vary depending on the hardware\. SCAPE leverages NVLink\-C2C with 900 GB/s throughput between the CPU and GPU on the GH200 superchips for offloading error feedback buffers\. Buffer offloading may introduce additional overhead on other GPU clusters with PCIe connections\. However, the impact of buffer offloading is minimal compared to the overall improvement with SCAPE\.

## IIBackground

In this section, we provide background information on distributed training with Megatron\-LM\[[17](https://arxiv.org/html/2607.01678#bib.bib10)\]with distributed optimizers, top\-kksparsification for gradient, and the AdamS\[[39](https://arxiv.org/html/2607.01678#bib.bib7)\]optimizer\.

### II\-ATraining with Sharded DP Distributed Optimizer

In the conventional distributed data parallel \(DDP\) training, each worker \(e\.g\., a GPU\) in a data parallel \(DP\) group duplicates all the model parameters, gradient, and optimizer states, and has a split of a micro\-batch of training samples\. After all workers finish computing their gradient using the local samples, they synchronize the gradient with a ringall\-reduceoperation before updating the optimizer states and model parameters\. Assume there areNNworkers and the gradient size isWW\(same as the model size\)\. This communication cost can be decomposed into areduce\-scatterfollowing anall\-gatheroperation, which is modeled as

Tring\-a​l​l​\-​r​e​d​u​c​e\\displaystyle T\_\{\\text\{ring\-$all\\text\{\-\}reduce$\}\}=Tr​e​d​u​c​e​\-​s​c​a​t​t​e​r\+Ta​l​l​\-​g​a​t​h​e​r\\displaystyle=T\_\{\\text\{$reduce\\text\{\-\}scatter$\}\}\+T\_\{\\text\{$all\\text\{\-\}gather$\}\}\(1\)=2​\(N−1\)​\[α\+β​WN\],\\displaystyle=2\(N\-1\)\\left\[\\alpha\+\\beta\\frac\{W\}\{N\}\\right\],whereα\\alphais the fixed latency to initialize a message, andβ\\betais the inverse of bandwidth\. As the number of workersNNincreases,2​W​β​\(N−1\)/N2W\\beta\(N\-1\)/Nis dominated by the gradient sizeWWand the inverse of the bandwidthβ\\beta\.

Duplicating the whole model and optimizer states for all workers in a DP group guarantees that the forward and backward computation can finish without communication\. However, the optimizer states usually require full precision \(FP32\) to guarantee the training stability, which becomes a main memory bottleneck in training large language models \(LLMs\) with increasingly more parameters\. Therefore, many frameworks have been proposed to reduce this bottleneck\. For instance, Megatron\-LM with sharded DP distributed optimizers partitions the optimizer states and the model parameters evenly among all DP workers\. Thus, before the forward pass, workers need to performall\-gatherto collect the model shards from each other\. Note that the model parameters can be prefetched to overlap the communication ofall\-gatherwith computation\. After the backward pass finishes, each DP worker has the complete gradient\. Since the optimizer states have been partitioned evenly, gradients’all\-reducesynchronization can be replaced byreduce\-scatterand does not require gathering \(i\.e\., avoidall\-gather\)\. This process is illustrated in[Figure 2](https://arxiv.org/html/2607.01678#S2.F2)\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/x2.png)Figure 2:Megatron\-LM with sharded data parallel distributed optimizer
### II\-BTop\-kkGradient Sparsification

Top\-kksparsification is a technique for reducingTring\-all\-reduceT\_\{\\text\{ring\-\}\\textit\{all\-reduce\}\}by reducingWWin[Equation 1](https://arxiv.org/html/2607.01678#S2.E1), given a fixed network bandwidth1/β1/\\beta\. Specifically, before synchronizing the gradientgg, we perform a top\-kkoperation to select thekklargest\-magnitude gradient values, and then synchronize those top\-kkvalues using a compressed buffer\. This effectively reduces the communication volume fromWWtok​WkW\.

Note that the sparsifiedggis biased, since the top\-kkoperator drops all but the largest\-magnitude entries in gradientggand therefore does not preserve the original gradient in expectation\. Consequently, directly applying top\-kksparsification at every iteration may introduce accumulated compression error and degrade convergence\. To mitigate this issue, prior work usually employs error feedback\[[29](https://arxiv.org/html/2607.01678#bib.bib16)\], which accumulates the unsent residual from the current step and adds it back to the gradient before the next top\-kkselection\. Letete\_\{t\}denote the residual buffer at steptt, then the communicated gradient isg~t=TopK⁡\(gt\+et\)\\tilde\{g\}\_\{t\}=\\operatorname\{TopK\}\(g\_\{t\}\+e\_\{t\}\), and the residual is updated aset\+1=gt\+et−g~te\_\{t\+1\}=g\_\{t\}\+e\_\{t\}\-\\tilde\{g\}\_\{t\}\. In this way, the information discarded in one iteration is not permanently lost, but is gradually re\-injected into later communications, improving optimization stability while retaining the communication savings of sparsification\. Nevertheless, this requires a dedicated error feedback buffer with the same size asggand usually in FP32 for preserving the accuracy\.

Since each worker’s data samples for computing the gradientgtg\_\{t\}at stepttare distinct, their top\-kkgradient indices are distinct as well\. Thus, directly usingall\-reduceto compute average of the sparsified gradient leads to wrong results\. There are two approaches for tackling this issue\. The first one is to perform twoall\-gathercollectives to collect both the top\-kkmask and the corresponding values, and the second method is to first useall\-gatherto form a synchronized top\-kkmask and useall\-reduceto synchronize the gradient entries selected by the globally synchronized mask\. However, the second approach has the fill\-in effect\. As described in\[[11](https://arxiv.org/html/2607.01678#bib.bib46)\], when the number of workersNNis large enough, the synchronized global top\-kkmask is dense rather than sparse\.

### II\-CAdamS Optimizer

AdamS\[[39](https://arxiv.org/html/2607.01678#bib.bib7)\]is a new Adam\-like optimizer which, instead of keeping a second\-momentvtv\_\{t\}as an exponential moving average \(EMA\) ofgt2g\_\{t\}^\{2\}, uses the first\-momentmtm\_\{t\}itself as the preconditioner:

mt=β1​mt−1\+\(1−β1\)​gt,\\displaystyle m\_\{t\}=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)g\_\{t\},\(2\)vt=β2​mt−1⊙2\+\(1−β2\)​gt⊙2,\\displaystyle v\_\{t\}=\\beta\_\{2\}\{\\color\[rgb\]\{1,0,0\}\{m\_\{t\-1\}^\{\\odot 2\}\}\}\+\(1\-\\beta\_\{2\}\)g\_\{t\}^\{\\odot 2\},wt=wt−1−η​\(mtvt\+ϵ\+λ​wt−1\),\\displaystyle w\_\{t\}=w\_\{t\-1\}\-\\eta\\left\(\\frac\{m\_\{t\}\}\{\\sqrt\{v\_\{t\}\}\+\\epsilon\}\+\\lambda w\_\{t\-1\}\\right\),whereβ1\\beta\_\{1\}andβ2\\beta\_\{2\}are the coefficients for computing the running average,wtw\_\{t\}is the model parameter at steptt,η\\etais the learning rate,λ\\lambdais the weight decay hyper\-parameter, andϵ\\epsilonmaintains the numerical stability\. Note that the update metric in[Equation 2](https://arxiv.org/html/2607.01678#S2.E2)is simplified for demonstration purposes, and in practice,mtm\_\{t\}andvtv\_\{t\}should be multiplied by their bias correction terms1/\(1−β1t\)1/\(1\-\\beta\_\{1\}^\{t\}\)and1/\(1−β2t\)1/\(1\-\\beta\_\{2\}^\{t\}\)\.

Compared to AdamW\[[13](https://arxiv.org/html/2607.01678#bib.bib2)\]\(i\.e\., Adam\[[10](https://arxiv.org/html/2607.01678#bib.bib8)\]with weight decay\), AdamS has superior performance in terms of training stability, model quality, and memory consumption\. First, when using the same set of hyperparameters, AdamS is more robust to noise than AdamW and avoids large spikes in its training loss curve\. Sincemtm\_\{t\}is the EMA of gradientgtg\_\{t\}, it is a smoother preconditioner than AdamW’svtv\_\{t\}, which is an EMA of gradient squaredgt2g\_\{t\}^\{2\}\. When large gradient occurs in the stable region where gradient norms are usually small, it is more robust to such noise\. Second, AdamS achieves model quality comparable to, and in some cases slightly better than, AdamW\. In GPT\-2\[[21](https://arxiv.org/html/2607.01678#bib.bib13)\]pretraining, AdamS closely mirrors AdamW across model scales and even attains slightly lower validation perplexity\. In post\-training with GRPO\[[26](https://arxiv.org/html/2607.01678#bib.bib9)\]on the Countdown task, AdamS yields score curves that closely align with those of AdamW and occasionally surpass its validation performance\. Finally, AdamS eliminates the need to store second moment, reducing optimizer state memory by 50% and leading to practical system benefits such as lower memory consumption in distributed training \(e\.g\., FSDP\[[40](https://arxiv.org/html/2607.01678#bib.bib12)\]\) and higher throughput in memory\-bound large\-scale pretraining\.

## IIIObservations

In this section, we present our observations when pre\-training the GPT\-345M model\. Based on these observations, we provide insights that form the foundations for our system design in[Section IV](https://arxiv.org/html/2607.01678#S4)\.

### III\-AGradient Distribution

We first pre\-train GPT\-345M with AdamW optimizer on the OpenWebText\[[7](https://arxiv.org/html/2607.01678#bib.bib6)\]dataset with top\-10% sparsity on the gradient, meaning that we only synchronize the gradient with the top\-10% largest magnitudes\. In this setup, we switch from a denseall\-reducewith no gradient compression \(baseline method\) to top\-10%10\\%sparsification at step 10,000\. Moreover, the top\-kksparsification is performed in a per\-layer style\. We can see from[3\(a\)](https://arxiv.org/html/2607.01678#S3.F3.sf1)that the difference between top\-10%10\\%sparsity and the baseline method rapidly increases and then slowly decreases\. We then try AdamS using the same setup, and the result is presented in[3\(b\)](https://arxiv.org/html/2607.01678#S3.F3.sf2)\. We notice that, using AdamS, the gap between the baseline method and the sparsified gradient is much smaller\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/loss_topk_residual_norm/AdamW_grad_loss_topk_residual_norm.png)\(a\)AdamW
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/loss_topk_residual_norm/AdamS_grad_loss_topk_residual_norm.png)\(b\)AdamS

Figure 3:Comparison between AdamW and AdamS after switching from denseall\-reduceto top\-10% sparsified gradient at step 10,000 of pre\-training GPT\-345MTo analyze these phenomena, we plot the norms of the top\-kksparsified gradient and the residual buffer, which stores the non\-top\-kkgradient accumulated during training\. From the curves of top\-kkgradient norm and the residual norm plotted in[3\(a\)](https://arxiv.org/html/2607.01678#S3.F3.sf1)and[3\(b\)](https://arxiv.org/html/2607.01678#S3.F3.sf2), we see that AdamW’s residual norm is much higher than AdamS’s\. With a high sparsity rate, such as 10% in this experiment, the staleness effect of error feedback can significantly influence AdamW’s training process\. The reason is that the chance of being selected as top\-kkis so low that many gradient values with very large magnitudes \(but still smaller than the top\-10% threshold\) have to wait until they have accumulated for several steps\. When such large stale gradient values are added back to the optimizer states, including the first\- and second\-moment, they cause the optimizer states to drift from the baseline’s direction and thus lead to a gradually larger gap between the baseline and top\-kkmethod\. This aligns well with the findings in\[[16](https://arxiv.org/html/2607.01678#bib.bib1)\]and\[[2](https://arxiv.org/html/2607.01678#bib.bib3)\]\(i\.e\., the loss spike is triggered by suddenly having a large gradient when the training regime enters a region where the square of the gradient and the second\-moment are both small\)\.

However, for AdamS, the norm of the top\-kkgradient and the norm of the residual values are small, so when adding them back, it does not lead to a significant shift in the optimization direction\. We plot the distribution of AdamW’s and AdamS’s gradient for different layers at two different steps in[Figure 4](https://arxiv.org/html/2607.01678#S3.F4)for GPT\-345M model\. We also provide the gradient distribution for Llama\-500M in[Figure 5](https://arxiv.org/html/2607.01678#S3.F5)\. From these results, we can see that AdamS’s gradient is more centralized around 0, while AdamW’s gradient distribution spreads out more evenly and is flatter\. This suggests that compared to AdamW, AdamS is more suitable for adapting to top\-kkgradient sparsification\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/gpt-345M/step_1k/layer_006_module.language_model.encoder.layers.0.self_attention.dense.weight_adam_vs_adamw.png)\(a\)Decoder 0 \- Self attention dense at step 1,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/gpt-345M/step_1k/layer_130_module.language_model.encoder.layers.10.mlp.dense_h_to_4h.weight_adam_vs_adamw.png)\(b\)Decoder 10 \- MLPdmodeld\_\{\\text\{model\}\}to4​dmodel4d\_\{\\text\{model\}\}at step 1,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/gpt-345M/step_50k/layer_006_module.language_model.encoder.layers.0.self_attention.dense.weight_adam_vs_adamw.png)\(c\)Decoder 0 \- Self attention dense at step 50,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/gpt-345M/step_50k/layer_028_module.language_model.encoder.layers.2.self_attention.query_key_value.weight_adam_vs_adamw.png)\(d\)Decoder 2 \- QKV projection weight at step 50,000

Figure 4:Gradient distribution of different layers in GPT\-345M at different steps![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/llama-500M/step_1k/iter_0001000_step_1000_module_decoder_layers_0_mlp_linear_fc1_weight.csv.png)\(a\)Decoder 0 \- gate and up project in SwiGLU at step 1,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/llama-500M/step_1k/iter_0001000_step_1000_module_decoder_layers_8_self_attention_linear_proj_weight.csv.png)\(b\)Decoder 8 \- Self attention dense at step 1,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/llama-500M/step_50k/iter_0050000_step_50000_module_decoder_layers_0_mlp_linear_fc1_weight.csv.png)\(c\)Decoder 0 \- gate and up project in SwiGLU at step 50,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/grad_distribution/llama-500M/step_50k/iter_0050000_step_50000_module_embedding_word_embeddings_weight.csv.png)\(d\)Word embedding at step 50,000

Figure 5:Gradient distribution of different layers in Llama\-500M at different steps
### III\-BTemporal Stability of first\-moment\-Derived Top\-kkMasks

Prior work has shown that sparse support patterns can exhibit temporal correlation across adjacent training steps\. For example, the temporal stability of AdamW’s top\-1% gradient has been studied in\[[41](https://arxiv.org/html/2607.01678#bib.bib4)\], and\[[36](https://arxiv.org/html/2607.01678#bib.bib5)\]reports strong gradient correlation during the early stage of LLM pre\-training\.

Nevertheless, in our setting, we observe that this property also holds when top\-kkmasks are constructed from AdamS’s first\-moment rather than raw gradients, for both GPT\-345M and Llama\-500M\. In[Figure 6](https://arxiv.org/html/2607.01678#S3.F6), we show the overlap between adjacent\-step top\-10% masks selected according to the magnitude of AdamS’s first\-moment across different layers and training stages\. As shown in[6\(a\)](https://arxiv.org/html/2607.01678#S3.F6.sf1), many layers exhibit high mask overlap at the beginning of training\. Although the overlap gradually decreases as training proceeds, it remains substantial throughout pre\-training, indicating that the top\-kkmask from stepttis a useful heuristic for stept\+1t\+1\. For Llama\-500M, the overlap curves in[6\(c\)](https://arxiv.org/html/2607.01678#S3.F6.sf3)and[6\(d\)](https://arxiv.org/html/2607.01678#S3.F6.sf4)are noisier, but the average overlap for each layer remains high across training stages\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/mask_overlap_rate/gpt-345m/step_1.png)\(a\)GPT\-345M from step 1
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/mask_overlap_rate/gpt-345m/step_10000.png)\(b\)GPT\-345M from step 10,000
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/mask_overlap_rate/llama2-500m/step_1.png)\(c\)Llama\-500M from step 1
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/mask_overlap_rate/llama2-500m/step_10k.png)\(d\)Llama\-500M from step 10,000

Figure 6:The overlap rate of AdamS’s first\-moment\-derived top\-10% mask between two adjacent steps for different stages of pre\-training GPT\-345M and Llama\-500M

## IVSystem Design

Motivated by the observations in[Section III](https://arxiv.org/html/2607.01678#S3), we design our system in[Algorithm 1](https://arxiv.org/html/2607.01678#alg1), which is mainly composed of three important components: 1\) new mask synchronization, 2\) top\-kkmask computation, and 3\) optimizer state update\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/x3.png)Figure 7:Communication in refreshing top\-kkmask\. Each work computes a sharded top\-kkmask and then usesall\-gatherto construct the full mask\. Since the usage of top\-kkmask is delayed by one step, asynchronouslyall\-gathercan be hidden by expensive backward computation\.Algorithm 1SCAPE1:model parameters

θ\\theta, dataset

DD, loss function

ff
2:step

tt, worker rank

nn, total number of workers

NN
3:density

ρ\\rho, weight decay

λ\\lambda, learning rate

η\\eta
4:number of layers in model

LL
5:

M0←𝟏M\_\{0\}\\leftarrow\\mathbf\{1\}⊳\\trianglerightInitialize sparse mask

6:

e0n←𝟎e\_\{0\}^\{n\}\\leftarrow\\mathbf\{0\}⊳\\trianglerightInitialize EF buffer

7:for

t=1,2,…t=1,2,\\dotsdo

8:

btn←0b\_\{t\}^\{n\}\\leftarrow 0⊳\\trianglerightRe\-initialize payload buffer

9:

gtn←∇f​\(Dtn,θt\)g\_\{t\}^\{n\}\\leftarrow\\nabla f\(D\_\{t\}^\{n\},\\theta\_\{t\}\)
10:

m~tn←β1​mt−1\+\(1−β1\)​gtn\+et−1n\\tilde\{m\}\_\{t\}^\{n\}\\leftarrow\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)g\_\{t\}^\{n\}\+e\_\{t\-1\}^\{n\}
11:

etn​\[¬Mt−1\]←mt~​\[¬Mt−1\]e\_\{t\}^\{n\}\[\\neg M\_\{t\-1\}\]\\leftarrow\\tilde\{m\_\{t\}\}\[\\neg M\_\{t\-1\}\]
12:

etn​\[Mt−1\]←0e\_\{t\}^\{n\}\[M\_\{t\-1\}\]\\leftarrow 0
13:

btn​\[Mt−1\]←m~tn​\[Mt−1\]b\_\{t\}^\{n\}\[M\_\{t\-1\}\]\\leftarrow\\tilde\{m\}\_\{t\}^\{n\}\[M\_\{t\-1\}\]
14:All\-reduce

b¯tn\\bar\{b\}\_\{t\}^\{n\}:

b¯t←1N​∑n=1Nbtn\\bar\{b\}\_\{t\}\\leftarrow\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}b\_\{t\}^\{n\}
15:

Mt←refresh\_mask​\(ρ,\|m~tn\|\)M\_\{t\}\\leftarrow\\text\{refresh\\\_mask\}\(\\rho,\|\\tilde\{m\}\_\{t\}^\{n\}\|\)⊳\\trianglerightCheck[Figure 7](https://arxiv.org/html/2607.01678#S4.F7)

16:

mt←b¯tm\_\{t\}\\leftarrow\\bar\{b\}\_\{t\}⊳\\trianglerightFirst\-moment’s nontopk all 0

17:

b¯t​\[Mt−1\]←\(b¯t​\[Mt−1\]−β1​mt−1​\[Mt−1\]\)/\(1−β1\)\\bar\{b\}\_\{t\}\[M\_\{t\-1\}\]\\leftarrow\(\\bar\{b\}\_\{t\}\[M\_\{t\-1\}\]\-\\beta\_\{1\}m\_\{t\-1\}\[M\_\{t\-1\}\]\)/\(1\-\\beta\_\{1\}\)
18:clip\_grad\(

b¯t\\bar\{b\}\_\{t\}\)

19:

vt←β2​mt−1⊙2\+\(1−β2\)​b¯t⊙2v\_\{t\}\\leftarrow\\beta\_\{2\}m\_\{t\-1\}^\{\\odot 2\}\+\(1\-\\beta\_\{2\}\)\\bar\{b\}\_\{t\}^\{\\odot 2\}
20:

ut←mt/\(vt\+ϵ\)u\_\{t\}\\leftarrow m\_\{t\}/\(\\sqrt\{v\_\{t\}\}\+\\epsilon\)⊳\\trianglerightCompute update metrics

21:

θt\+1←θt−η​\(ut\+λ​θt\)\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}\-\\eta\(u\_\{t\}\+\\lambda\\theta\_\{t\}\)
22:endfor

### IV\-ARefreshed Mask Synchronization

We propose a new mask refresh strategy illustrated in[Figure 7](https://arxiv.org/html/2607.01678#S4.F7)\. This method has two important differences from the existing work\. First, instead of having all workers compute their own top\-kkmasks for all layers, each worker only computes a portion\. This modification aligns with the design of Megatron\-LM’s distributed optimizer, where the optimizer states are partitioned evenly onto all workers, and at each step, each worker updates its own portion only\. Moreover, when the distributed optimizer is not used, this design reduces each worker’s workload of performing the top\-kkoperations byNN\(i\.e\., the total number of workers in a DP group\)\. This design is also immune to the fill\-in effect mentioned in\[[11](https://arxiv.org/html/2607.01678#bib.bib46)\]: as the number of workers grows, the gradient ‘sparsified’ by the collected top\-kkindices from all workers is nearly dense\. Since each worker in our design treats the top\-kkmasks for layers computed by its peers as all zeros, this avoids the fill\-in effect\. Second, the top\-kkmasks computed from stepttare later used at stept\+1t\+1\. This delayed use of the top\-kkmasks provides an opportunity to use asynchronous communication to hide the synchronization communication in heavy computations, such as the forward\- and backward\-pass\.

### IV\-BTop\-kkMask Computation

The top\-kkmask can be computed from the gradient and moment\. SCAPE selects the first\-moment to construct the top\-kkmask\. The reason is that, as the gradient contains noise inherited from the data, the top\-kkvalues generated from the gradient of the current step can have a low overlap rate with the real top\-kkgradient’s indices in the next step\. Therefore, we use the first\-moment to generate the top\-kkmask, as it maintains a running average of the gradient, which is more stable and has much lower noise\. Specifically, we use\|m~tn\|=\|β1​mt−1\+\(1−β1\)​gtn\+et−1n\|\|\\tilde\{m\}\_\{t\}^\{n\}\|=\|\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)g\_\{t\}^\{n\}\+e\_\{t\-1\}^\{n\}\|to construct the top\-kkmasks\. Becausegtng\_\{t\}^\{n\}andet−1ne\_\{t\-1\}^\{n\}are not synchronized and contain local information, each worker’s top\-kkmask guarantees that its local largest values in\|m~tn\|\|\\tilde\{m\}\_\{t\}^\{n\}\|are immediately selected and then used to update the optimizer states and model parameters\.

### IV\-COptimizer State Update

In[Algorithm 1](https://arxiv.org/html/2607.01678#alg1), SCAPE transmits a sparsified bufferbtnb\_\{t\}^\{n\}, which has the values in the temporal, semi\-updated first\-moment bufferm~tn\\tilde\{m\}\_\{t\}^\{n\}on positions selected by the top\-kkmaskMt−1M\_\{t\-1\}\. For DGC\[[12](https://arxiv.org/html/2607.01678#bib.bib14)\]and DeMo\[[19](https://arxiv.org/html/2607.01678#bib.bib19)\], transmitting the sparsified first\-moment is sufficient to update the optimizer state, because they use SGD optimizer\[[23](https://arxiv.org/html/2607.01678#bib.bib20)\]\. However, from[Equation 2](https://arxiv.org/html/2607.01678#S2.E2), to update the second\-moment, we need both the synchronized first\-moment and the gradient\. A naive implementation would perform twoall\-reduceoperations to synchronize both\. Nevertheless, we noticed that oneall\-reducesuffices\. Given thatmtm\_\{t\}is guaranteed to be synchronized, we can compute the globally averaged gradient and residual as

g¯t\+e¯t−1=\(b¯t−β1​mt−1\)/\(1−β1\)\.\\bar\{g\}\_\{t\}\+\\bar\{e\}\_\{t\-1\}=\(\\bar\{b\}\_\{t\}\-\\beta\_\{1\}m\_\{t\-1\}\)/\(1\-\\beta\_\{1\}\)\.\(3\)Then, we useg¯t\+e¯t−1\\bar\{g\}\_\{t\}\+\\bar\{e\}\_\{t\-1\}as the synchronized top\-kkto update the second\-momentvtv\_\{t\}and the model parameterswtw\_\{t\}\.

### IV\-DOptimizations for Distributed Optimizers

We can see from[Algorithm 1](https://arxiv.org/html/2607.01678#alg1)that sinceb¯t\\bar\{b\}\_\{t\}only has nonzero values for indices selected by top\-kkmaskMt−1M\_\{t\-1\}, we have

mt​\[¬Mt−1\]=b¯t​\[¬Mt−1\]=0\.m\_\{t\}\[\\neg M\_\{t\-1\}\]=\\bar\{b\}\_\{t\}\[\\neg M\_\{t\-1\}\]=0\.\(4\)Therefore, when computing the update metrics, we have

ut​\[¬Mt−1\]=mt​\[¬Mt−1\]/\(vt​\[¬Mt−1\]\+ϵ\)=0\.u\_\{t\}\[\\neg M\_\{t\-1\}\]=m\_\{t\}\[\\neg M\_\{t\-1\}\]/\(\\sqrt\{v\_\{t\}\[\\neg M\_\{t\-1\}\]\}\+\\epsilon\)=0\.\(5\)Hence,θt\+1​\[¬Mt−1\]\\theta\_\{t\+1\}\[\\neg M\_\{t\-1\}\]is updated as

θt\+1​\[¬Mt−1\]\\displaystyle\\theta\_\{t\+1\}\[\\neg M\_\{t\-1\}\]=θt​\[¬Mt−1\]−η​\(ut​\[¬Mt−1\]\+λ​θt​\[¬Mt−1\]\)\\displaystyle=\\theta\_\{t\}\[\\neg M\_\{t\-1\}\]\-\\eta\(u\_\{t\}\[\\neg M\_\{t\-1\}\]\+\\lambda\\theta\_\{t\}\[\\neg M\_\{t\-1\}\]\)\(6\)=θt​\[¬Mt−1\]−η​\(0\+λ​θt​\[¬Mt−1\]\)\\displaystyle=\\theta\_\{t\}\[\\neg M\_\{t\-1\}\]\-\\eta\(0\+\\lambda\\theta\_\{t\}\[\\neg M\_\{t\-1\}\]\)=\(1−η​λ\)​θt​\[¬Mt−1\]\.\\displaystyle=\(1\-\\eta\\lambda\)\\theta\_\{t\}\[\\neg M\_\{t\-1\}\]\.This suggests that if the model parametersθ\\thetais partitioned among all workers, we can avoid the expensiveall\-gatheroperation forθt\+1​\[¬Mt−1\]\\theta\_\{t\+1\}\[\\neg M\_\{t\-1\}\]by computing it from each worker’s local FP32 copy of the complete model \(see[Figure 2](https://arxiv.org/html/2607.01678#S2.F2)\)\. To reduce the memory overhead of keeping full model parameters on every worker, we offload the full\-parameter replica to CPU memory\. During parameter updates, we use a double\-buffered pipeline that overlaps asynchronous CPU\-to\-GPU prefetch and sparse payload communication with local non\-topk weight\-decay updates, then writes back the updated buffer and offloads it to the host memory for the next iteration\.

Algorithm 2Optimize Megatron\-LM distributed optimizer with SCAPE1:model parameter full FP32 copy

ww, parameter for forward and backward computation

w^\\hat\{w\}, parameter for communication

w~\\tilde\{w\}, local parameter

w′w^\{\\prime\}updated by partitioned optimizer states, top\-

kkmask

MM, learning rate

η\\eta, weight decay

λ\\lambda, worker rank

nn, total number of workers

NN, step

tt, dataset

DD
2:functionForwardPass

3:

w~tn←wt′​\[Mt−1\]\\tilde\{w\}\_\{t\}^\{n\}\\leftarrow w^\{\\prime\}\_\{t\}\[M\_\{t\-1\}\]⊳\\trianglerightCompress comm volume

4:

w^t​\[Mt−1\]←AllGather​\(w~tn\)\\hat\{w\}\_\{t\}\[M\_\{t\-1\}\]\\leftarrow\\text\{AllGather\}\(\\tilde\{w\}\_\{t\}^\{n\}\)
5:

w^t​\[¬Mt−1\]←\(1−η​λ\)​w^t​\[¬Mt−1\]\\hat\{w\}\_\{t\}\[\\neg M\_\{t\-1\}\]\\leftarrow\(1\-\\eta\\lambda\)\\hat\{w\}\_\{t\}\[\\neg M\_\{t\-1\}\]⊳\\trianglerightUse FP32wtw\_\{t\}to compute untransmitted params

6:ForwardPass\(

DtnD\_\{t\}^\{n\},

w^t\\hat\{w\}\_\{t\}\)

7:endfunction

8:functionBackwardPass

9:Start AsyncMaskRefresh\(MtM\_\{t\}\)⊳\\trianglerightHide mask sync with BW computation

10:

gtn←g\_\{t\}^\{n\}\\leftarrowBackwardPass\(

DtnD\_\{t\}^\{n\},

w^t\\hat\{w\}\_\{t\}\)

11:endfunction

12:ReduceScatter\(gtng\_\{t\}^\{n\}\)

13:

btn←b\_\{t\}^\{n\}\\leftarrowComputeUpdatedMoments\(

gtng\_\{t\}^\{n\}\)

14:

b¯t←\\bar\{b\}\_\{t\}\\leftarrowAllReduce\(

btn​\[Mt−1\]b\_\{t\}^\{n\}\[M\_\{t\-1\}\]\)

15:UpdateLocalOptimizerStatesAndParams\(

b¯t\\bar\{b\}\_\{t\}\)

[Algorithm 2](https://arxiv.org/html/2607.01678#alg2)shows the optimized Megatron\-LM with a distributed optimizer\. If the model parameters are partitioned and need to be gathered across all workers before the forward\- and backward\-pass, we can reduce such communication volume by the inverse of the sparsity rate1/k1/k\. Additionally, to hide the communication cost of synchronizing and updatingMtM\_\{t\}, we launch the asynchronousall\-gatherforMtM\_\{t\}after the backward\-pass’sall\-gatherfinishes, guaranteeing no communications interference\. After the backward pass finishes, when workers synchronize their local temporal, semi\-updated first\-momentbtnb\_\{t\}^\{n\}, we replace thereduce\-scatterwith theall\-reduceoperation, because the LayerNorm, RMSNorm, and bias layers are usually not sparsified as they are less stable than the weights, leading to uneven buffers forreduce\-scatteron workers\.

### IV\-EOptimization for Memory consumption

SCAPE holds one additional residual buffer for error feedback, which has the same size as the model and uses FP32 for accuracy\. It also needs two buffers for the masks of the current and next steps\. Thus, the total memory overhead can be expressed asMemoverhead=Memcur\_mask\+Memnext\_mask\+Memresidual=W\+W\+4​W=6​W\\text\{Mem\}\_\{\\text\{overhead\}\}=\\text\{Mem\}\_\{\\text\{cur\\\_mask\}\}\+\\text\{Mem\}\_\{\\text\{next\\\_mask\}\}\+\\text\{Mem\}\_\{\\text\{residual\}\}=W\+W\+4W=6Wbytes, whereWWis the number of parameters, and the dtype for top\-kkmasks and residual are INT8 and FP32\.

To solve additional memory bottlenecks, we used CPU offloading with double buffering for residual and full parameters, minimizing the influence on both per\-step time and memory\. Furthermore, each value in a mask is represented as one bit, and we pack eight of them into one byte\. Note that the smallest dtype supported by PyTorch operations is INT8, and this single\-bit format is not supported\. Therefore, we store both masks in packed single\-bit format and use double buffers to convert when needed\. We also use Triton kernels for packing and unpacking operations to reduce their computation overhead\. After applying these memory optimization techniques, the memory overhead is2×W/8\+2​P\+8​P=W/4\+10​P2\\times W/8\+2P\+8P=W/4\+10Pbytes, wherePPrepresents the number of parameters in the largest layer,2​P2Prepresents the memory usage for double buffers for masks in INT8, and8​P8Prepresents the double buffers for the residual offloading\.

## VExperiment Results

In this section, we evaluate SCAPE from three complementary perspectives\. We first present the pre\-training, downstream, and end\-to\-end wall\-clock results for Llama\-500M, our primary modern LLM workload for demonstrating practical systems benefit\. We then report GPT\-345M results to show that SCAPE generalizes beyond a single architecture family and training configuration\. Finally, we analyze per\-step time and strong\-scaling efficiency for Llama\-500M and Llama\-1\.8B under Megatron\-LM’s distributed optimizer, including the SCAPE\-specific optimizations, from 4 to 64 GPUs\.

### V\-AExperiment Setup

We evaluate SCAPE by pre\-training GPT\-345M and Llama\-500M on 32 NVIDIA GH200 GPUs of the Vista supercomputer\[[24](https://arxiv.org/html/2607.01678#bib.bib17)\]at the Texas Advanced Computing Center \(TACC\)\. Each Vista node consists of a Grace\-Hopper architecture with one GH200 GPU, 96 GB of HBM3 memory, and an NVLink\-C2C interconnect between the Grace CPU and Hopper GPU\. The nodes are connected through a 400 Gbps NVIDIA NDR InfiniBand network\. To ensure a consistent and reproducible software environment, all experiments were conducted using the NVIDIA NGC PyTorch container \(v26\.01\)\.

We evaluate the pre\-trained models using zero\-shot downstream benchmarks using lm\-evaluation\-harness\[[6](https://arxiv.org/html/2607.01678#bib.bib48)\], including ARC \(Easy and Challenge\)\[[5](https://arxiv.org/html/2607.01678#bib.bib31)\], LAMBADA\[[18](https://arxiv.org/html/2607.01678#bib.bib24)\], HellaSwag\[[38](https://arxiv.org/html/2607.01678#bib.bib26)\], MMLU\[[8](https://arxiv.org/html/2607.01678#bib.bib29)\], PIQA\[[3](https://arxiv.org/html/2607.01678#bib.bib30)\], WinoGrande\[[25](https://arxiv.org/html/2607.01678#bib.bib28)\], OpenBookQA\[[15](https://arxiv.org/html/2607.01678#bib.bib33)\], and SuperGLUE\[[33](https://arxiv.org/html/2607.01678#bib.bib21)\]\.

### V\-BLlama\-500M Pre\-training

#### V\-B1Model Architecture and Hyperparameters

We use the same model architecture of H2O\-Danube3\-500M\[[20](https://arxiv.org/html/2607.01678#bib.bib22)\]to define our Llama\-500M model and pre\-trained it onSlimPajama\-6B, a subdataset sampled from SlimPajama\[[27](https://arxiv.org/html/2607.01678#bib.bib18)\]\. We used the same tokenizer as Llama\-7B\[[31](https://arxiv.org/html/2607.01678#bib.bib23)\]fromHuggingFace\. We pre\-train Llama\-500M for 100,000 steps and use a global batch size of 1024 and a sequence length of 4096 for each step\. The hyperparameters for the optimizers are: peak learning rateη=3×10−4\\eta=3\\times 10^\{\-4\}, the minimum learning rateηmin=3×10−5\\eta\_\{\\text\{min\}\}=3\\times 10^\{\-5\}, cosine learning rate decay, learning rate warmup for 2,000 steps,\(β1,β2\)=\(0\.9,0\.95\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.95\), RMSNormϵ=10−5\\epsilon=10^\{\-5\}, weight decayλ=0\.1\\lambda=0\.1, and the gradient clip of 1\. We use BF16 for parameters and FP32 for the gradient\. For SCAPE, we use sparsity warmup to exponentially decrease densityddfrom 1 to 0\.1 or 0\.01, and we do not compress RMSNorm layers\.

TABLE I:Final training and validation loss and end\-to\-end \(E2E\) wall\-clock time of pre\-training Llama\-500M![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/experiment_results/llama/train_loss.png)\(a\)Training loss
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/experiment_results/llama/validation_loss.png)\(b\)Validation loss

Figure 8:Pre\-training loss curves for Llama\-500MTABLE II:Zero\-shot evaluation results for Llama\-500M pre\-trained using different approaches

#### V\-B2Pre\-training Results

The pre\-training loss curves for Llama\-500M using AdamW, AdamS, SCAPE \(dd= 0\.1\), and SCAPE \(dd= 0\.01\) are shown in[Figure 8](https://arxiv.org/html/2607.01678#S5.F8)\. Their final end\-to\-end \(E2E\) wall\-clock times, training losses, and validation losses are summarized in[Table I](https://arxiv.org/html/2607.01678#S5.T1)\. Since AdamW is still the de facto optimizer for pre\-training LLMs, we include it as a reference baseline\. Compared with dense AdamS, SCAPE \(dd= 0\.1\) reduces wall\-clock time from 2\.47 to 1\.59 days \(1\.55×\\timesspeedup, 35\.6% reduction\) and lowers validation loss from 2\.32 to 2\.30\. SCAPE \(dd= 0\.01\) further reduces time to 1\.40 days \(1\.76×\\timesspeedup, 43\.3% reduction\) with validation loss 2\.31, which is still below AdamS\. These results demonstrate SCAPE as a quality\-preserving communication\-efficient method for large\-scale pre\-training: it delivers substantial E2E wall\-clock speedups while maintaining, and in validation loss slightly improving, model quality similar to dense AdamS\.

#### V\-B3Downstream Task Evaluation

[Table II](https://arxiv.org/html/2607.01678#S5.T2)shows that SCAPE largely preserves downstream task performance for the pre\-trained Llama\-500M despite using aggressive communication sparsity\. Specifically, SCAPE \(dd= 0\.1\) outperforms dense AdamW on 5 of 16 tasks and dense AdamS on 9 of 16 tasks\. In particular, SCAPE \(dd= 0\.1\) surpasses both dense baselines on RTE and ReCoRD, while also improving over AdamW on ARC\-E and OpenBookQA, and over AdamS on ARC\-C, ARC\-E, LAMBADA, OpenBookQA, PIQA, WSC, and MultiRC\. Even atdd= 0\.01, SCAPE remains competitive, exceeding AdamW on 4 of 16 tasks and AdamS on 7 of 16 tasks, including 3 SuperGLUE improvements over AdamW and 4 SuperGLUE improvements over AdamS\. These results indicate that SCAPE retains strong downstream generalization for Llama\-500M under 90% and 99% sparsity, withdd= 0\.1 offering the best balance between compression and quality preservation\.

### V\-CGPT\-345M Pre\-training

#### V\-C1Model Architecture and Hyperparameters

We pre\-train GPT\-345M model using the same model architecture definition described in\[[21](https://arxiv.org/html/2607.01678#bib.bib13)\]on OpenWebText dataset\[[7](https://arxiv.org/html/2607.01678#bib.bib6)\]for 100,000 steps\. We set the global batch size to 512 and the sequence length to 1024\. As for the hyperparameters of AdamW, AdamS, and SCAPE, we set\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\), the peak learning rateη=1\.5×10−4\\eta=1\.5\\times 10^\{\-4\}, the minimum learning rateηmin=10−5\\eta\_\{\\text\{min\}\}=10^\{\-5\}, the learning rate scheduler to cosine, the learning rate warmup steps to 5,000 \(5% of the pre\-training steps\), the weight decayλ=0\.01\\lambda=0\.01, and gradient clip to 1\. Similar to Llama\-500M training, mixed\-precision training is used: the model parameters are in BF16, and the gradient is accumulated in FP32\. Similarly, we use density warmup to gradually decreaseddfrom 1 to 0\.1 or 0\.01\. The LayerNorm and bias layers in GPT\-345M were not compressed\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/experiment_results/gpt/train_loss.png)\(a\)Training loss
![Refer to caption](https://arxiv.org/html/2607.01678v1/figures/experiment_results/gpt/validation_loss.png)\(b\)Validation loss

Figure 9:Pre\-training loss curves for GPT\-345MTABLE III:Final training and validation loss of pre\-training GPT\-345MTABLE IV:Zero\-shot evaluation results for GPT\-345M pre\-trained using different methods

#### V\-C2Pre\-training Results

The training and validation loss curves for pre\-training GPT\-345M model with AdamW, AdamS, SCAPE \(dd= 0\.1\), and SCAPE \(dd= 0\.01\) are shown in[Figure 9](https://arxiv.org/html/2607.01678#S5.F9), and the final training and validation loss are listed in[Table III](https://arxiv.org/html/2607.01678#S5.T3)\.

Surprisingly, given the same token budget, when usingd=0\.1d=0\.1\(90% sparsity\), SCAPE achieves lower training and validation loss than the dense AdamS\. When the sparsity rate is increased to 99% \(i\.e\.,d=0\.01d=0\.01\), the final training and validation losses of SCAPE differ only slightly from those of AdamS, with differences below 0\.04\. Additionally, the difference between AdamW and SCAPE withd=0\.01d=0\.01is so small that it can be considered negligible\.

#### V\-C3Downstream Task Evaluation

[Table IV](https://arxiv.org/html/2607.01678#S5.T4)presents the zero\-shot evaluation results of GPT\-345M on the same downstream benchmark suite used for Llama\-500M\. Using dense AdamW as the reference baseline, SCAPE achieves strong task\-level gains under aggressive sparsity\. In particular, SCAPE \(dd= 0\.1\) improves accuracy on 9 of 16 tasks, including ARC\-C, HellaSwag, LAMBADA, MMLU, OpenBookQA, BoolQ, WiC, CB, and ReCoRD\. Even when the sparsity is increased todd= 0\.01, SCAPE still outperforms AdamW on 8 of 16 tasks\. These results indicate that GPT\-345M is highly tolerant to SCAPE’s sparse synchronization, especially atdd= 0\.1, where sparse training delivers improvements on a majority of the reported zero\-shot evaluations while maintaining competitive overall downstream performance\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/x4.png)\(a\)Llama\-500M
![Refer to caption](https://arxiv.org/html/2607.01678v1/x5.png)\(b\)Llama\-1\.8B

Figure 10:Per\-step time comparison between different methods used for training llama\-500M \(sequence length 4K\) and Llama\-1\.8B \(sequence length 2K\)

### V\-DPer\-step Time Analysis

We profile the time per iteration for pre\-training Llama\-500M \(global batch 1024 and sequence length 4K\) and Llama\-1\.8B \(global batch size 1024 and sequence length 2K\) with eight configurations\. These configurations include AdamS, AdamS with distributed optimizer, SCAPE \(dd= 0\.1\), SCAPE \(dd= 0\.1\) with distributed optimizer, SCAPE \(dd= 0\.1\) with distributed optimizer and CPU offloading, SCAPE \(dd= 0\.01\), SCAPE \(dd= 0\.01\) with distributed optimizer, and SCAPE \(dd= 0\.01\) with distributed optimizer and CPU offloading, where we define the model architecture of Llama\-1\.8B by following the definition of H2O\-Danube3\-1\.8B\[[20](https://arxiv.org/html/2607.01678#bib.bib22)\]\. The profiling results ranging from 4 to 64 GH200 GPUs are plotted in[Figure 10](https://arxiv.org/html/2607.01678#S5.F10)\. Note that SCAPE \(dd= 0\.1\) and SCAPE \(dd= 0\.01\) refer to the DDP setting, and each worker has the full replicated model and optimizer states in GPU memory\.

As we can see from the profiling results, SCAPE can efficiently reduce the time per iteration for pre\-training both Llama\-500M and Llama\-1\.8B under different training configurations\. For Llama\-500M training, the per\-step time reduction using SCAPE \(dd= 0\.1\) and SCAPE \(dd= 0\.01\) is less pronounced than Llama\-1\.8B\. For Llama\-500M on 64 GPUs, SCAPE withdd= 0\.1 anddd= 0\.01 reduces the per\-step time from 1753\.66 ms under dense AdamS to 1077\.40 ms and 908\.19 ms, corresponding to 1\.63×\\timesand 1\.93×\\timesspeedup, respectively\. For Llama\-1\.8B, SCAPE reduces the per\-step time from dense AdamS’s 4804\.44 ms to 1983\.08 ms and 1473\.45 ms, yielding 2\.42×\\timesand 3\.26×\\timesspeedup, respectively\. The reason is that with such a high number of tokens per step and a smaller model size than Llama\-1\.8B, the per\-step training time for Llama\-500M is dominated by computation rather than communication, which can be found from[Figure 1](https://arxiv.org/html/2607.01678#S1.F1)\.

We notice that when the distributed optimizer is used, AdamS’s per\-step time is reduced, while SCAPE’s is slightly increased for Llama\-500M\. The reason is that by sharding the model and optimizer states to all workers for AdamS, each worker’s computation time for updating the model and states is reduced\. Moreover, the composition ofreduce\-scatterfor synchronizing gradient andall\-gatherfor gathering the model shards in the distributed optimizer setting has the same cost asTring\-all\-reduceT\_\{\\text\{ring\-\{all\-reduce\}\}\}\(see[Equation 1](https://arxiv.org/html/2607.01678#S2.E1)\)\. Therefore, using distributed optimizer reduces the computation for each worker without introducing additional communication overhead\. However, for SCAPE with distributed optimizer, since each worker still needs to update the full model, the computation is not reduced\. Furthermore, the use ofall\-reduceintroduces additional communication overhead compared toreduce\-scatter\(see[Equation 1](https://arxiv.org/html/2607.01678#S2.E1)\), although its communication volume is significantly reduced\. Hence, using SCAPE with distributed optimizer has slightly higher per\-step time than SCAPE when communication does not dominate the per\-step time, such as training Llama\-500M on four GPUs\.

From[Figure 10](https://arxiv.org/html/2607.01678#S5.F10), we can see that adding CPU offload to SCAPE with distributed optimizer does not add overhead for transmitting between CPU and GPU for each step\. Thanks to the high bandwidth provided by GH200’s NVLink\-C2C of 900 GB/s, our double\-buffering scheme for CPU offload only has minimal impact on the per\-step time\. Nevertheless, this could add extra overhead for systems that use PCIe for connecting CPU and GPU, which has a lower bandwidth than NVLink\-C2C\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/x6.png)\(a\)Llama\-500M
![Refer to caption](https://arxiv.org/html/2607.01678v1/x7.png)\(b\)Llama\-1\.8B

Figure 11:Strong scaling efficiency for training Llama\-500M \(sequence length 4K\) and Llama\-1\.8B \(sequence length 2K\) with different methods
### V\-EStrong Scaling Efficiency

The strong scaling efficiency results for training Llama\-500M and Llama\-1\.8B using all aforementioned methods on 4, 8, 16, 32, and 64 GH200 GPUs are plotted in[Figure 11](https://arxiv.org/html/2607.01678#S5.F11)\. Using the four\-GPU setup as the baseline, the strong scaling efficiency improves with SCAPE for both models: for Llama\-500M at 64 GPUs, efficiency is improved from AdamS’s 30\.98% to 46\.11% for SCAPE \(dd= 0\.1\) and 54\.56% for SCAPE \(dd= 0\.01\), and up to 63\.69% when using SCAPE \(dd= 0\.01\) with distributed optimizer and CPU offloading; for Llama\-1\.8B on 64 GPUs which has communication dominating the per\-step time \(see[Figure 1](https://arxiv.org/html/2607.01678#S1.F1)\), the same pattern is more obvious, improving from 17\.58% for AdamS and 21\.48% for AdamS with distributed optimizer to 37\.15% for SCAPE \(dd= 0\.1\), 47\.83% for SCAPE \(dd= 0\.01\), and 55\.04% for SCAPE \(dd= 0\.01\) with distributed optimizer and CPU offloading\.

![Refer to caption](https://arxiv.org/html/2607.01678v1/x8.png)\(a\)Llama\-500M
![Refer to caption](https://arxiv.org/html/2607.01678v1/x9.png)\(b\)Llama\-1\.8B

Figure 12:Memory usage for training Llama\-500M \(sequence length 4K\) and Llama\-1\.8B \(sequence length 2K\)
### V\-FMemory Consumption

We profile the memory consumption for training Llama\-500M and Llama\-1\.8B under different settings, and the results are shown in[Figure 12](https://arxiv.org/html/2607.01678#S5.F12)\. Since the memory overhead of residual buffer and full model parameters is determined by the number of model parameters and does not depend on SCAPE’sddor the number of workers, only the memory usage fordd= 0\.1 on 4 GPUs is reported\. We can see that with residual offloading enabled, the memory usage for training Llama\-1\.8B is reduced from 68,290 MB to 61,930 MB and from 75,832 MB to 69,474 MB before and after distributed optimizer is enabled\. For Llama\-500M training, because its model size is small and most of GPU memory is consumed by long\-sequence activations, using residual offloading does not significantly reduce its memory usage\.

## VIRelated Work

Many research works have been proposed to achieve communication\-efficient pre\-training through compression on gradient and model weights\. Specifically, they can be classified into three main categories: gradient sparsification with error feedback, low\-rank approximation, and quantization\.

For gradient sparsification, DGC\[[12](https://arxiv.org/html/2607.01678#bib.bib14)\]proposes a momentum\-based sparsification method to reduce the communication volume of synchronizing gradient in DDP\. DeMo\[[19](https://arxiv.org/html/2607.01678#bib.bib19)\]proposes a decoupled momentum optimization method which first performs DCT to orthonormalize the momentum and then uses top\-kksparsification with error feedback to reduce communication volume\. Okk\-Topkk\[[11](https://arxiv.org/html/2607.01678#bib.bib46)\]proposes a novel sparseall\-reducealgorithm to achieve near asymptotically optimal communication volume\. EDGC\[[37](https://arxiv.org/html/2607.01678#bib.bib36)\]proposes an entropy\-driven adaptive gradient sparsification framework to dynamically specify the sparsity rate for each model layer\. Radius\[[41](https://arxiv.org/html/2607.01678#bib.bib4)\]exploits the temporal stability of the top\-kkgradient values’ indices to amortize the computation cost of top\-kkoperations, and thus achieves lower computation overhead in gradient sparsification and higher throughput\.

For low\-rank compression, ATOMO\[[35](https://arxiv.org/html/2607.01678#bib.bib38)\]proposes to use singular value decomposition \(SVD\) to express gradients as atomic components and transmits only a subset of them to reduce communication overhead\. PowerSGD\[[32](https://arxiv.org/html/2607.01678#bib.bib37)\]proposes a low\-rank approximation of the gradient by representing it with two much smaller factor matrices\. Optimus\-CC\[[28](https://arxiv.org/html/2607.01678#bib.bib39)\]proposes a novel framework, combining 3D parallel training with PowerSGD\.

For quantization\-based methods, QSDP\[[14](https://arxiv.org/html/2607.01678#bib.bib42)\]proposes to extend FSDP\[[40](https://arxiv.org/html/2607.01678#bib.bib12)\]with both model weight and gradient quantization to reduce the communication cost with convergence guaranteed\. ZeRO\+\+\[[34](https://arxiv.org/html/2607.01678#bib.bib41)\]improves the communication efficiency of ZeRO\[[22](https://arxiv.org/html/2607.01678#bib.bib11)\]through the combination of block\-wise quantizedall\-gather, communication\-aware data remapping, and quantized gradient averaging built on all\-to\-all communication SDP4Bit\[[9](https://arxiv.org/html/2607.01678#bib.bib40)\]proposes to use 4\-bit quantization to reduce both weight and gradient communication via quantizing the weight differences and a two\-level smooth quantization scheme for gradients, while also introducing runtime optimizations to mitigate quantization overhead\.

## VIIConclusion and Future Works

We present SCAPE, a communication\-efficient distributed optimizer for LLM pre\-training implemented based on Megatron\-LM that achieves aggressive sparsification while preserving model quality on downstream evaluation tasks\. Instead of sparsifying and transmitting the raw gradients, SCAPE leverages the stability of AdamS’s first\-moment to construct top\-kkmasks for sparse communication, aligns the top\-kkmask generation with optimizer sharding, and applies the top\-kkmasks with a one\-step delay to overlap the mask synchronization with computation\. SCAPE also reconstructs the quantities required for second\-moment updates from a single synchronized sparse buffer, thereby avoiding an additional collective\. Through extensive evaluations, we show that SCAPE preserves the training loss, the validation loss, and the scores of the downstream task under both 90% and 99% sparsity\. For Llama\-500M, SCAPE reduces end\-to\-end pre\-training wall\-clock time by up to 43\.3% relative to dense baselines while maintaining comparable model quality\. For Llama\-1\.8B, SCAPE achieves 3\.26×\\timesspeedup on 64 GPUs compared to dense AdamS\. In future work, we will explore extending SCAPE’s moment\-based sparsification to other optimizers, including Muon, which is gaining increasing popularity in production LLM training\.

## References

- \[1\]\(2024\)Large language models for mathematical reasoning: progresses and challenges\.External Links:2402\.00157,[Link](https://arxiv.org/abs/2402.00157)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1)\.
- \[2\]Z\. Bai, Z\. Zhou, J\. Zhao, X\. Li, Z\. Li, F\. Xiong, H\. Yang, Y\. Zhang, and Z\. J\. Xu\(2025\)Adaptive preconditioners trigger loss spikes in adam\.External Links:2506\.04805,[Link](https://arxiv.org/abs/2506.04805)Cited by:[§III\-A](https://arxiv.org/html/2607.01678#S3.SS1.p2.5)\.
- \[3\]Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi\(2019\)PIQA: Reasoning about Physical Commonsense in Natural Language\.External Links:1911\.11641,[Link](https://arxiv.org/abs/1911.11641)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[4\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Link](https://arxiv.org/abs/2107.03374)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1)\.
- \[5\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge\.External Links:1803\.05457,[Link](https://arxiv.org/abs/1803.05457)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[6\]L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou\(2024\-07\)The Language Model Evaluation Harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://doi.org/10.5281/zenodo.12608602)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[7\]A\. Gokaslan, V\. Cohen, E\. Pavlick, and S\. Tellex\(2019\)OpenWebText Corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)External Links:[Document](https://dx.doi.org/10.5281/zenodo.3834942),[Link](https://doi.org/10.5281/zenodo.3834942)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p5.1),[§III\-A](https://arxiv.org/html/2607.01678#S3.SS1.p1.3),[§V\-C1](https://arxiv.org/html/2607.01678#S5.SS3.SSS1.p1.5)\.
- \[8\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring Massive Multitask Language Understanding\.External Links:2009\.03300,[Link](https://arxiv.org/abs/2009.03300)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[9\]J\. Jia, C\. Xie, H\. Lu, D\. Wang, H\. Feng, C\. Zhang, B\. Sun, H\. Lin, Z\. Zhang, X\. Liu, and D\. Tao\(2024\)SDP4Bit: Toward 4\-bit Communication Quantization in Sharded Data Parallelism for LLM Training\.External Links:2410\.15526,[Link](https://arxiv.org/abs/2410.15526)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§VI](https://arxiv.org/html/2607.01678#S6.p4.1)\.
- \[10\]D\. P\. Kingma and J\. Ba\(2017\)Adam: A Method for Stochastic Optimization\.External Links:1412\.6980,[Link](https://arxiv.org/abs/1412.6980)Cited by:[§II\-C](https://arxiv.org/html/2607.01678#S2.SS3.p2.4)\.
- \[11\]S\. Li and T\. Hoefler\(2022\-03\)Near\-optimal sparse allreduce for distributed deep learning\.InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,External Links:[Document](https://dx.doi.org/10.1145/3503221.3508399),[Link](https://doi.org/10.1145/3503221.3508399)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§II\-B](https://arxiv.org/html/2607.01678#S2.SS2.p3.7),[§IV\-A](https://arxiv.org/html/2607.01678#S4.SS1.p1.9),[§VI](https://arxiv.org/html/2607.01678#S6.p2.5)\.
- \[12\]Y\. Lin, S\. Han, H\. Mao, Y\. Wang, and W\. J\. Dally\(2020\)Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training\.External Links:1712\.01887,[Link](https://arxiv.org/abs/1712.01887)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§IV\-C](https://arxiv.org/html/2607.01678#S4.SS3.p1.5),[§VI](https://arxiv.org/html/2607.01678#S6.p2.5)\.
- \[13\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled Weight Decay Regularization\.External Links:1711\.05101,[Link](https://arxiv.org/abs/1711.05101)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§II\-C](https://arxiv.org/html/2607.01678#S2.SS3.p2.4)\.
- \[14\]I\. Markov, A\. Vladu, Q\. Guo, and D\. Alistarh\(2023\)Quantized Distributed Training of Large Models with Convergence Guarantees\.External Links:2302\.02390,[Link](https://arxiv.org/abs/2302.02390)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§VI](https://arxiv.org/html/2607.01678#S6.p4.1)\.
- \[15\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2381–2391\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1260),[Link](https://doi.org/10.18653/v1/D18-1260)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[16\]I\. Molybog, P\. Albert, M\. Chen, Z\. DeVito, D\. Esiobu, N\. Goyal, P\. S\. Koura, S\. Narang, A\. Poulton, R\. Silva, B\. Tang, D\. Liskovich, P\. Xu, Y\. Zhang, M\. Kambadur, S\. Roller, and S\. Zhang\(2023\)A Theory on Adam Instability in Large\-Scale Machine Learning\.External Links:2304\.09871,[Link](https://arxiv.org/abs/2304.09871)Cited by:[§III\-A](https://arxiv.org/html/2607.01678#S3.SS1.p2.5)\.
- \[17\]Megatron\-lmExternal Links:[Link](https://github.com/NVIDIA/Megatron-LM)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1),[§I](https://arxiv.org/html/2607.01678#S1.p5.1),[§II](https://arxiv.org/html/2607.01678#S2.p1.1)\.
- \[18\]D\. Paperno, G\. Kruszewski, A\. Lazaridou, Q\. N\. Pham, R\. Bernardi, S\. Pezzelle, M\. Baroni, G\. Boleda, and R\. Fernández\(2016\)The LAMBADA dataset: Word prediction requiring a broad discourse context\.External Links:1606\.06031,[Link](https://arxiv.org/abs/1606.06031)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[19\]B\. Peng, L\. Chen, B\. Su, J\. Quesnelle, D\. P\. Kingma, and Q\. Liu\(2026\)DeMo: Decoupled Momentum Optimization\.External Links:2411\.19870,[Link](https://arxiv.org/abs/2411.19870)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§IV\-C](https://arxiv.org/html/2607.01678#S4.SS3.p1.5),[§VI](https://arxiv.org/html/2607.01678#S6.p2.5)\.
- \[20\]P\. Pfeiffer, P\. Singer, Y\. Babakhin, G\. Fodor, N\. Dhankhar, and S\. S\. Ambati\(2024\)H2O\-Danube3 Technical Report\.External Links:2407\.09276,[Link](https://arxiv.org/abs/2407.09276)Cited by:[§V\-B1](https://arxiv.org/html/2607.01678#S5.SS2.SSS1.p1.6),[§V\-D](https://arxiv.org/html/2607.01678#S5.SS4.p1.8)\.
- \[21\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever\(2019\)Language Models are Unsupervised Multitask Learners\.Cited by:[§II\-C](https://arxiv.org/html/2607.01678#S2.SS3.p2.4),[§V\-C1](https://arxiv.org/html/2607.01678#S5.SS3.SSS1.p1.5)\.
- \[22\]S\. Rajbhandari, J\. Rasley, O\. Ruwase, and Y\. He\(2020\)ZeRO: Memory Optimizations Toward Training Trillion Parameter Models\.External Links:1910\.02054,[Link](https://arxiv.org/abs/1910.02054)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1),[§VI](https://arxiv.org/html/2607.01678#S6.p4.1)\.
- \[23\]H\. Robbins and S\. Monro\(1951\)A Stochastic Approximation Method\.The Annals of Mathematical Statistics22\(3\),pp\. 400–407\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177729586),[Link](https://doi.org/10.1214/aoms/1177729586)Cited by:[§IV\-C](https://arxiv.org/html/2607.01678#S4.SS3.p1.5)\.
- \[24\]A\. Ruhela, J\. Cazes, J\. D\. McCalpin, C\. Del\-Castillo\-Negrete, J\. Li, H\. Liu, H\. Chen, C\. Lu, K\. F\. Milfeld, W\. Zhang, I\. Wang, L\. Koesterke, J\. DeSantis, N\. Lewis, S\. Hempel, and D\. Stanzione\(2024\)Performance Analysis of Scientific Applications on an NVIDIA Grace System\.InSC24\-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis,pp\. 558–566\.External Links:[Document](https://dx.doi.org/10.1109/SCW63240.2024.00078),[Link](https://doi.org/10.1109/SCW63240.2024.00078)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p1.1)\.
- \[25\]K\. Sakaguchi, R\. Le Bras, C\. Bhagavatula, and Y\. Choi\(2020\)WinoGrande: An Adversarial Winograd Schema Challenge at Scale\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 8732–8740\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v34i05.6399),[Link](https://doi.org/10.1609/aaai.v34i05.6399)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[26\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo\(2024\)DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§II\-C](https://arxiv.org/html/2607.01678#S2.SS3.p2.4)\.
- \[27\]D\. Soboleva, F\. Al\-Khateeb, R\. Myers, J\. R\. Steeves, J\. Hestness, and N\. Dey\(2023\-06\)SlimPajama: A 627B token cleaned and deduplicated version of RedPajama\.External Links:[Link](https://huggingface.co/datasets/cerebras/SlimPajama-627B)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p5.1),[§V\-B1](https://arxiv.org/html/2607.01678#S5.SS2.SSS1.p1.6)\.
- \[28\]J\. Song, J\. Yim, J\. Jung, H\. Jang, H\. Kim, Y\. Kim, and J\. Lee\(2023\)Optimus\-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression\.External Links:2301\.09830,[Link](https://arxiv.org/abs/2301.09830)Cited by:[§VI](https://arxiv.org/html/2607.01678#S6.p3.1)\.
- \[29\]S\. U\. Stich, J\. Cordonnier, and M\. Jaggi\(2018\)Sparsified SGD with Memory\.External Links:1809\.07599,[Link](https://arxiv.org/abs/1809.07599)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§II\-B](https://arxiv.org/html/2607.01678#S2.SS2.p2.10)\.
- \[30\]N\. J\. Szymanski, B\. Rendy, Y\. Fei, R\. E\. Kumar, T\. He, D\. Milsted, M\. J\. McDermott, M\. Gallant, E\. D\. Cubuk, A\. Merchant, H\. Kim, A\. Jain, C\. J\. Bartel, K\. Persson, Y\. Zeng, and G\. Ceder\(2023/12/01\)An autonomous laboratory for the accelerated synthesis of inorganic materials\.Nature624\(7990\),pp\. 86–91\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06734-w),ISBN 1476\-4687,[Link](https://doi.org/10.1038/s41586-023-06734-w)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1)\.
- \[31\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. C\. Ferrer, M\. Chen, G\. Cucurull, D\. Esiobu, J\. Fernandes, J\. Fu, W\. Fu, B\. Fuller, C\. Gao, V\. Goswami, N\. Goyal, A\. Hartshorn, S\. Hosseini, R\. Hou, H\. Inan, M\. Kardas, V\. Kerkez, M\. Khabsa, I\. Kloumann, A\. Korenev, P\. S\. Koura, M\. Lachaux, T\. Lavril, J\. Lee, D\. Liskovich, Y\. Lu, Y\. Mao, X\. Martinet, T\. Mihaylov, P\. Mishra, I\. Molybog, Y\. Nie, A\. Poulton, J\. Reizenstein, R\. Rungta, K\. Saladi, A\. Schelten, R\. Silva, E\. M\. Smith, R\. Subramanian, X\. E\. Tan, B\. Tang, R\. Taylor, A\. Williams, J\. X\. Kuan, P\. Xu, Z\. Yan, I\. Zarov, Y\. Zhang, A\. Fan, M\. Kambadur, S\. Narang, A\. Rodriguez, R\. Stojnic, S\. Edunov, and T\. Scialom\(2023\)Llama 2: Open Foundation and Fine\-Tuned Chat Models\.External Links:2307\.09288,[Link](https://arxiv.org/abs/2307.09288)Cited by:[§V\-B1](https://arxiv.org/html/2607.01678#S5.SS2.SSS1.p1.6)\.
- \[32\]T\. Vogels, S\. P\. Karimireddy, and M\. Jaggi\(2020\)PowerSGD: Practical Low\-Rank Gradient Compression for Distributed Optimization\.External Links:1905\.13727,[Link](https://arxiv.org/abs/1905.13727)Cited by:[§VI](https://arxiv.org/html/2607.01678#S6.p3.1)\.
- \[33\]A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman\(2020\)SuperGLUE: a stickier benchmark for general\-purpose language understanding systems\.External Links:1905\.00537,[Link](https://arxiv.org/abs/1905.00537)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[34\]G\. Wang, H\. Qin, S\. A\. Jacobs, C\. Holmes, S\. Rajbhandari, O\. Ruwase, F\. Yan, L\. Yang, and Y\. He\(2023\)ZeRO\+\+: Extremely Efficient Collective Communication for Giant Model Training\.External Links:2306\.10209,[Link](https://arxiv.org/abs/2306.10209)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§VI](https://arxiv.org/html/2607.01678#S6.p4.1)\.
- \[35\]H\. Wang, S\. Sievert, Z\. Charles, S\. Liu, S\. Wright, and D\. Papailiopoulos\(2018\)ATOMO: Communication\-efficient Learning via Atomic Sparsification\.External Links:1806\.04090,[Link](https://arxiv.org/abs/1806.04090)Cited by:[§VI](https://arxiv.org/html/2607.01678#S6.p3.1)\.
- \[36\]Q\. Yi, J\. Duan, H\. Hu, Q\. Hua, H\. Zhao, S\. Qian, D\. Yang, J\. Cao, J\. Tang, Y\. Yu, C\. Liao, K\. Wang, and L\. Zhang\(2025\)EDGC: Entropy\-driven Dynamic Gradient Compression for Efficient LLM Training\.External Links:2511\.10333,[Link](https://arxiv.org/abs/2511.10333)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§III\-B](https://arxiv.org/html/2607.01678#S3.SS2.p1.1)\.
- \[37\]Q\. Yi, J\. Duan, H\. Hu, Q\. Hua, H\. Zhao, S\. Qian, D\. Yang, J\. Cao, J\. Tang, Y\. Yu, C\. Liao, K\. Wang, and L\. Zhang\(2025\)EDGC: Entropy\-driven Dynamic Gradient Compression for Efficient LLM Training\.External Links:2511\.10333,[Link](https://arxiv.org/abs/2511.10333)Cited by:[§VI](https://arxiv.org/html/2607.01678#S6.p2.5)\.
- \[38\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)HellaSwag: Can a Machine Really Finish Your Sentence?\.External Links:1905\.07830,[Link](https://arxiv.org/abs/1905.07830)Cited by:[§V\-A](https://arxiv.org/html/2607.01678#S5.SS1.p2.1)\.
- \[39\]H\. Zhang, B\. Wang, and L\. Chen\(2025\-11\)AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post\-training\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10719–10738\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.543),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.543),ISBN 979\-8\-89176\-332\-6Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p4.1),[§II\-C](https://arxiv.org/html/2607.01678#S2.SS3.p1.3),[§II](https://arxiv.org/html/2607.01678#S2.p1.1)\.
- \[40\]Y\. Zhao, A\. Gu, R\. Varma, L\. Luo, C\. Huang, M\. Xu, L\. Wright, H\. Shojanazeri, M\. Ott, S\. Shleifer, A\. Desmaison, C\. Balioglu, P\. Damania, B\. Nguyen, G\. Chauhan, Y\. Hao, A\. Mathews, and S\. Li\(2023\)PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel\.External Links:2304\.11277,[Link](https://arxiv.org/abs/2304.11277)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p1.1),[§II\-C](https://arxiv.org/html/2607.01678#S2.SS3.p2.4),[§VI](https://arxiv.org/html/2607.01678#S6.p4.1)\.
- \[41\]M\. Zheng and Z\. Zhang\(2025\)Radius: Range\-based Gradient Sparsity for Large Foundation Model Pre\-training\.InProceedings of Machine Learning and Systems,M\. Zaharia, G\. Joshi, and Y\. Lin \(Eds\.\),Vol\.7,pp\.\.External Links:[Link](https://proceedings.mlsys.org/paper_files/paper/2025/file/54dd9e0cff6d9214e20d97eb2a3bae49-Paper-Conference.pdf)Cited by:[§I](https://arxiv.org/html/2607.01678#S1.p2.1),[§I](https://arxiv.org/html/2607.01678#S1.p3.2),[§III\-B](https://arxiv.org/html/2607.01678#S3.SS2.p1.1),[§VI](https://arxiv.org/html/2607.01678#S6.p2.5)\.

Similar Articles