Tag
The paper introduces the context-ready transformer, a recurrent architecture that pre-contextualizes tokens before the transformer block, achieving significant inference speedups (e.g., 1.7x on A100) while matching or exceeding standard transformer performance with fewer layers.
This paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model for high-resolution text-to-image synthesis, introducing a token-editing mechanism and grouped cross-entropy objective to improve token refinement and training efficiency.
This paper introduces LISA, a regularization method that aligns the intermediate features of a side network with an approximated likelihood score to improve training efficiency and the quality of visual-condition controllable generation in score-based generative models.
Introduces Holistic Data Scheduler (HDS), a reinforcement learning-based framework that dynamically adjusts data mixtures during LLM pre-training using a multi-objective reward function, achieving 44% fewer iterations to reach target perplexity and a 7.2% improvement on MMLU.
Introduces Bayesian Manifold Curriculum (BMC), an adaptive curriculum learning method for LLMs that leverages the model's latent geometry to allocate training effort across diverse problem types, improving efficiency beyond traditional difficulty-based curricula.
Introduces PACI, a bubble-free asynchronous pipeline parallel training method that bounds forward/backward weight inconsistency using local gradient accumulation, achieving higher throughput and faster time-to-accuracy without sacrificing stability or memory usage.
This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.
This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.
This paper systematically analyzes the sensitivity of tool-calling evaluations to minor implementation choices such as random seeds and multi-turn templates, revealing that these can cause substantial performance variation. It also identifies sources of computational waste in RL-based tool-calling training and introduces techniques to accelerate training without sacrificing performance.
The paper proposes FBOS-RL, a feedback-driven bi-objective synergistic reinforcement learning framework that improves training efficiency and performance ceiling over GRPO in LLM alignment and reasoning by using feedback-guided exploration and two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation.
This paper proposes Diffusion-Adaptive Routing (DAR), a learnable, timestep-adaptive residual replacement that improves cross-layer information flow in Diffusion Transformers, leading to significant training acceleration and quality improvements.
Lens is a compact 3.8B-parameter text-to-image model from Microsoft that achieves competitive performance with larger models while requiring significantly less training compute, using dense captions, multi-resolution batching, and efficient architecture.
Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.
NVIDIA trained a 12-billion parameter LLM in 4-bit precision using the new NVFP4 format with micro-scaling, achieving near-zero intelligence loss while halving memory usage and tripling arithmetic speed, marking a major breakthrough in efficient AI training.
Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification, achieving state-of-the-art alignment quality with substantial training acceleration.
This paper investigates the impact of subword tokenization on LLM training efficiency and performance by conducting controlled byte-level pretraining experiments. It reveals key factors such as training throughput and the integration of subword boundaries as linguistic priors.
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.
This paper introduces SimReg, a regularization technique for LLM pretraining that uses embedding similarity to improve training convergence by over 30% and boost zero-shot performance.
This paper introduces NoiseRater, a meta-learning framework that assigns importance scores to individual noise samples during diffusion model training to improve efficiency and generation quality.
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.