parallelism

#parallelism

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

X AI KOLs Following ↗ · 4d ago Cached

Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.

0 favorites 0 likes

#parallelism

@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…

X AI KOLs Following ↗ · 4d ago Cached

Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.

0 favorites 0 likes

#parallelism

@yukangchen_: Excited to share our new blog: Scaling Video Training with Parallelism https://research.nvidia.com/labs/eai/blogs/scali…

X AI KOLs Following ↗ · 2026-06-08 Cached

This blog from NVIDIA Research discusses how sequence parallelism can scale long-video training systems for both understanding and generation, addressing the challenge of fitting very long video sequences across multiple GPUs.

0 favorites 0 likes

#parallelism

What I learned building low latency and high throughput AI agents

Reddit r/AI_Agents ↗ · 2026-06-05

The article shares practical lessons for building low-latency, high-throughput AI agents, including workload estimation, token reduction, parallelism, microservices, and handling LLM failures.

0 favorites 0 likes

#parallelism

@swyx: Kakuna: skills with checklists that only know how to harden your codebase /plan with it then let it /goal for a day, it…

X AI KOLs Following ↗ · 2026-05-22 Cached

Kakuna is a skill that hardens codebases by automating boring tasks, making production-ready commits with audit trails, and encoding opinions on designing apps for both human and agent access, focusing on subagent parallelism and the 'mullet factory' approach.

0 favorites 0 likes

#parallelism

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

X AI KOLs Timeline ↗ · 2026-05-21 Cached

A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.

0 favorites 0 likes

#parallelism

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

arXiv cs.LG ↗ · 2026-05-20

DynaTrain is a distributed training system enabling sub-second online reconfiguration of parallelism for large language models, using a Virtual Parameter Space abstraction to achieve up to three orders of magnitude faster transitions than existing methods.

0 favorites 0 likes

#parallelism

SNLP: Layer-Parallel Inference via Structured Newton Corrections

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

This paper introduces SNLP, a framework that enables layer-parallel inference for transformers by replacing exact Newton corrections with structured approximations, achieving up to 2.3x speedup on a 0.5B model while improving perplexity.

0 favorites 0 likes

#parallelism

Notes on pretraining parallelisms and failed training runs (12 minute read)

TLDR AI ↗ · 2026-05-18 Cached

A technical deep-dive into common causes of failed pretraining runs in large language models, including causality-breaking issues in expert routing and numerical precision bugs, with examples from Llama 4, Gemini 2 Pro, and GPT-4.

0 favorites 0 likes

#parallelism

Data race freedom in OxCaml

Lobsters Hottest ↗ · 2026-05-16 Cached

OxCaml, Jane Street's fork of the OCaml compiler, introduces compile-time guarantees against data races, enabling sequential consistency without runtime overhead. The blog post explains the new mode axes and their implications for parallel programming.

0 favorites 0 likes

#parallelism

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces DisagMoE, a system for MoE training that optimizes computation-communication overlap by disaggregating attention and FFN layers across GPU groups. Implemented on Megatron-LM, it achieves up to 1.8x speedup on H800 clusters by addressing inter-node communication bottlenecks.

0 favorites 0 likes

#parallelism

Techniques for training large neural networks

OpenAI Blog ↗ · 2022-06-09 Cached

OpenAI presents comprehensive techniques for training large neural networks across distributed GPU clusters, covering data parallelism, pipeline parallelism, tensor parallelism, and mixture-of-experts approaches to overcome engineering and scalability challenges.

0 favorites 0 likes

parallelism

Submit Feedback