ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Reddit r/MachineLearning Papers

Summary

ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.

[](https://www.reddit.com/r/MachineLearning/?f=flair_name%3A%22Research%22)Macrocosmos has released a paper on ResBM (Residual Bottleneck Models), a new transformer-based architecture designed for low-bandwidth pipeline-parallel training. [https://arxiv.org/abs/2604.11947](https://arxiv.org/abs/2604.11947) ResBM introduces a residual encoder-decoder bottleneck across pipeline boundaries, with the goal of reducing inter-stage communication while preserving an explicit low-rank identity path. The paper reports SOTA 128× activation compression without significant loss in convergence relative to uncompressed baselines. In their experiments, the strongest compressed results use Muon, and the paper positions ResBM as a development in decentralized / internet-grade pipeline parallel training.
Original Article

Similar Articles

Block-Based Double Decoders

arXiv cs.LG

Proposes block-based double decoders, a novel transformer architecture using doubly-causal block-based attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, achieving strong scaling performance and reduced KV-cache memory.

BA-T: An Iterative Transformer for Two-View Bundle Adjustment

Hugging Face Daily Papers

BA-T is an iterative Transformer architecture for two-view bundle adjustment that improves 3D reconstruction accuracy and cross-view consistency using a lightweight design with only 16% of conventional decoder parameters, matching or surpassing larger models.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

arXiv cs.CL

This paper introduces the Structured Recurrent Mixer (SRM), an architecture enabling algebraic conversion between parallel training and recurrent inference without specialized kernels. Experiments show SRMs achieve significantly higher throughput and concurrency compared to Transformers, with effective performance in reinforcement learning tasks.