Tag
SCAPE is a communication-efficient distributed optimizer that leverages first-moment statistics to enable extreme sparsification for LLM training, preserving accuracy while reducing wall-clock time by up to 43.3%.
GLM 5.2 post-training code is open-sourced, using Megatron-LM for training and SGLang for rollout generation, forming a continuous RL loop with synchronized weights.
This paper introduces DisagMoE, a system for MoE training that optimizes computation-communication overlap by disaggregating attention and FFN layers across GPU groups. Implemented on Megatron-LM, it achieves up to 1.8x speedup on H800 clusters by addressing inter-node communication bottlenecks.