Tag
Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.
Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.
This blog from NVIDIA Research discusses how sequence parallelism can scale long-video training systems for both understanding and generation, addressing the challenge of fitting very long video sequences across multiple GPUs.
The article shares practical lessons for building low-latency, high-throughput AI agents, including workload estimation, token reduction, parallelism, microservices, and handling LLM failures.
Kakuna is a skill that hardens codebases by automating boring tasks, making production-ready commits with audit trails, and encoding opinions on designing apps for both human and agent access, focusing on subagent parallelism and the 'mullet factory' approach.
A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.
DynaTrain is a distributed training system enabling sub-second online reconfiguration of parallelism for large language models, using a Virtual Parameter Space abstraction to achieve up to three orders of magnitude faster transitions than existing methods.
This paper introduces SNLP, a framework that enables layer-parallel inference for transformers by replacing exact Newton corrections with structured approximations, achieving up to 2.3x speedup on a 0.5B model while improving perplexity.
A technical deep-dive into common causes of failed pretraining runs in large language models, including causality-breaking issues in expert routing and numerical precision bugs, with examples from Llama 4, Gemini 2 Pro, and GPT-4.
OxCaml, Jane Street's fork of the OCaml compiler, introduces compile-time guarantees against data races, enabling sequential consistency without runtime overhead. The blog post explains the new mode axes and their implications for parallel programming.
This paper introduces DisagMoE, a system for MoE training that optimizes computation-communication overlap by disaggregating attention and FFN layers across GPU groups. Implemented on Megatron-LM, it achieves up to 1.8x speedup on H800 clusters by addressing inter-node communication bottlenecks.
OpenAI presents comprehensive techniques for training large neural networks across distributed GPU clusters, covering data parallelism, pipeline parallelism, tensor parallelism, and mixture-of-experts approaches to overcome engineering and scalability challenges.