ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]
Summary
ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.
Similar Articles
WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers
This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.
Block-Based Double Decoders
Proposes block-based double decoders, a novel transformer architecture using doubly-causal block-based attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, achieving strong scaling performance and reduced KV-cache memory.
Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
This paper proposes H-Res, a method to adapt large transformer models by shaping the energy landscape of associative memories without modifying weights or adding prompts, preserving memory capacity and outperforming LoRA.
BA-T: An Iterative Transformer for Two-View Bundle Adjustment
BA-T is an iterative Transformer architecture for two-view bundle adjustment that improves 3D reconstruction accuracy and cross-view consistency using a lightweight design with only 16% of conventional decoder parameters, matching or surpassing larger models.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
This paper introduces the Structured Recurrent Mixer (SRM), an architecture enabling algebraic conversion between parallel training and recurrent inference without specialized kernels. Experiments show SRMs achieve significantly higher throughput and concurrency compared to Transformers, with effective performance in reinforcement learning tasks.