Tag
DeepMind introduces Decoupled DiLoCo, a new distributed AI training architecture that enables resilient, low-bandwidth training of large models across globally dispersed data centers by isolating hardware failures.
ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.
Hugging Face publishes a comprehensive analysis of 16 open-source reinforcement learning libraries, examining architectural patterns for asynchronous RL training and presenting design lessons for TRL's async trainer to address generation bottlenecks and weight synchronization challenges.
Ulysses Sequence Parallelism is a technique for training LLMs with million-token contexts by distributing sequence chunks across GPUs, reducing memory requirements and enabling efficient long-context training. It integrates with HuggingFace Accelerate, Transformers Trainer, and TRL, with support for Flash Attention and DeepSpeed ZeRO.
OpenAI presents comprehensive techniques for training large neural networks across distributed GPU clusters, covering data parallelism, pipeline parallelism, tensor parallelism, and mixture-of-experts approaches to overcome engineering and scalability challenges.
This paper details the design and optimization of PyTorch's distributed data parallel module, highlighting techniques like gradient bucketing and computation-communication overlap that enable near-linear scalability across 256 GPUs.