ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Reddit r/MachineLearning 04/16/26, 03:08 PM Papers

transformer-architecture pipeline-parallelism activation-compression low-bandwidth-training residual-bottleneck distributed-training

Summary

ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.

[](https://www.reddit.com/r/MachineLearning/?f=flair_name%3A%22Research%22)Macrocosmos has released a paper on ResBM (Residual Bottleneck Models), a new transformer-based architecture designed for low-bandwidth pipeline-parallel training. [https://arxiv.org/abs/2604.11947](https://arxiv.org/abs/2604.11947) ResBM introduces a residual encoder-decoder bottleneck across pipeline boundaries, with the goal of reducing inter-stage communication while preserving an explicit low-rank identity path. The paper reports SOTA 128× activation compression without significant loss in convergence relative to uncompressed baselines. In their experiments, the strongest compressed results use Muon, and the paper positions ResBM as a development in decentralized / internet-grade pipeline parallel training.

Original Article

Similar Articles

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

arXiv cs.LG

This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.

Block-Based Double Decoders

arXiv cs.LG

Proposes block-based double decoders, a novel transformer architecture using doubly-causal block-based attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, achieving strong scaling performance and reduced KV-cache memory.

Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping

arXiv cs.LG

This paper proposes H-Res, a method to adapt large transformer models by shaping the energy landscape of associative memories without modifying weights or adding prompts, preserving memory capacity and outperforming LoRA.

BA-T: An Iterative Transformer for Two-View Bundle Adjustment

Hugging Face Daily Papers

BA-T is an iterative Transformer architecture for two-view bundle adjustment that improves 3D reconstruction accuracy and cross-view consistency using a lightweight design with only 16% of conventional decoder parameters, matching or surpassing larger models.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

arXiv cs.CL

This paper introduces the Structured Recurrent Mixer (SRM), an architecture enabling algebraic conversion between parallel training and recurrent inference without specialized kernels. Experiments show SRMs achieve significantly higher throughput and concurrency compared to Transformers, with effective performance in reinforcement learning tasks.

Similar Articles

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

Block-Based Double Decoders

Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping

BA-T: An Iterative Transformer for Two-View Bundle Adjustment

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

Submit Feedback