pipeline-parallelism

#pipeline-parallelism

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces MAPL, a method for learned orthogonal compression of activations in pipeline parallelism, reducing communication overhead while maintaining performance via Stiefel manifold constraints and per-stage factorized anchor embeddings.

0 favorites 0 likes

#pipeline-parallelism

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper proposes Speculative Pipeline Decoding (SPD), a framework that uses pipeline parallelism within a single LLM to enable parallel token speculation, avoiding the latency bubbles and accuracy degradation of multi-token prediction in traditional speculative decoding.

0 favorites 0 likes

#pipeline-parallelism

SpaceX has almost finished writing V1.0 of an in-house AI training stack in C (2 minute read)

TLDR AI ↗ · 2026-05-29

SpaceX is finalizing a custom AI training stack written in C, utilizing pipeline parallelism and 220k GB300 GPUs to achieve over an order of magnitude speed improvement, with plans to develop an inference stack for reinforcement learning.

0 favorites 0 likes

#pipeline-parallelism

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

X AI KOLs Timeline ↗ · 2026-05-21 Cached

A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.

0 favorites 0 likes

#pipeline-parallelism

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Reddit r/LocalLLaMA ↗ · 2026-05-17

This article benchmarks vLLM, SGLang, and llama.cpp on a mixed Blackwell/Ada GPU cluster for long context prefill, finding vLLM significantly outperforms others on heterogeneous setups while SGLang crashes with Ada cards due to FP4 support limitations.

0 favorites 0 likes

#pipeline-parallelism

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Reddit r/MachineLearning ↗ · 2026-04-16

ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.

0 favorites 0 likes

pipeline-parallelism

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

SpaceX has almost finished writing V1.0 of an in-house AI training stack in C (2 minute read)

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Submit Feedback