One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Hugging Face Daily Papers 06/29/26, 12:00 AM Papers

asynchronous-pipeline gradient-staleness llm-pretraining optimizer error-feedback large-scale

Summary

This paper challenges the assumption that one-step gradient delay in asynchronous pipeline parallelism is inherently unstable, showing that degradation depends on optimizer choice. It demonstrates that optimizers like Muon are robust to one-step delay and introduces an error-feedback correction to further mitigate staleness, achieving near-synchronous performance in LLM pretraining up to 10B parameters.

Modern large-scale LLM pretraining benefits from utilizing Pipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. Asynchronous Pipeline Parallelism eliminates these bubbles, maximizing throughput at the cost of gradient staleness. Among asynchronous schedules, PipeDream-2BW is particularly appealing: unlike the original PipeDream schedule, it ensures a constant one-step gradient delay regardless of pipeline depth. However, its adoption remains limited due to the common belief that optimizing under staleness is fundamentally unstable. In this work, we challenge this assumption, demonstrating that degradation under one-step delay depends strongly on optimizer choice rather than being an intrinsic limitation. We provide the first comprehensive empirical analysis showing that while AdamW, the predominant optimizer at the time when PipeDream-2BW was introduced, indeed suffers from severe degradation, recent methods like Muon exhibit strong robustness under a one-step delay. We introduce an optimizer-agnostic Error Feedback-inspired correction to further mitigate delay effects. We provide supporting theoretical analysis demonstrating convergence for Muon with and without this correction. Extensive evaluation on models up to 10B parameters confirms that our strategies bridge the performance gap with synchronous training, highlighting the practical potential of asynchronous pipeline parallelism at scale.

Original Article

View Cached Full Text

Cached at: 06/30/26, 03:37 PM

Paper page - One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Source: https://huggingface.co/papers/2606.30634

Abstract

Asynchronous pipeline parallelism with PipeDream-2BW can achieve near-synchronous performance through optimizer selection and error feedback correction, overcoming traditional stability concerns.

Modern large-scale LLM pretraining benefits from utilizingPipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. AsynchronousPipeline Parallelismeliminates these bubbles, maximizing throughput at the cost ofgradient staleness. Among asynchronous schedules,PipeDream-2BWis particularly appealing: unlike the original PipeDream schedule, it ensures a constant one-step gradient delay regardless of pipeline depth. However, its adoption remains limited due to the common belief that optimizing under staleness is fundamentally unstable. In this work, we challenge this assumption, demonstrating that degradation under one-step delay depends strongly on optimizer choice rather than being an intrinsic limitation. We provide the first comprehensive empirical analysis showing that whileAdamW, the predominant optimizer at the time whenPipeDream-2BWwas introduced, indeed suffers from severe degradation, recent methods likeMuonexhibit strong robustness under a one-step delay. We introduce an optimizer-agnosticError Feedback-inspired correction to further mitigate delay effects. We provide supporting theoretical analysis demonstrating convergence forMuonwith and without this correction. Extensive evaluation on models up to 10B parameters confirms that our strategies bridge the performance gap with synchronous training, highlighting the practical potential of asynchronouspipeline parallelismat scale.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.30634

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.30634 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.30634 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.30634 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Paper page - One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

Can Muon Fine-tune Adam-Pretrained Models?

Submit Feedback

Similar Articles

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

Can Muon Fine-tune Adam-Pretrained Models?