Tag
This paper introduces FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function that dynamically schedules gradient weights during RL post-training of LLMs, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines.
A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.
This paper analyzes token-level gradient dynamics in RLVR training, revealing how advantage sign and token probability jointly affect update stability, and introduces Winner Advantage Policy Optimization (WAPO) which performs clipped updates only on positive-advantage completions to improve stability.
AAD-1 introduces asymmetric adversarial distillation with phased training to achieve one-step autoregressive video generation, outperforming prior methods on VBench.
Fern announces a new regularization technique that solves the SolidGoldMagikarp stability problem, with details to follow in the thread.
DVAO adaptively weights objectives based on reward variance to improve multi-reward RL training stability and multi-objective performance.
This paper identifies gradient oscillation and residual explosion as causes of training instability in Looped Transformers, and proposes Fully Looped Transformer with two parameter-free modifications (Fully Looped Architecture and Attention Injection) to stabilize training up to 12 loop iterations, achieving up to 13.2% improvement in downstream performance.
This paper introduces LBW-Guard, a bounded autonomous training control governance layer that operates above the AdamW optimizer to monitor telemetry and apply bounded control during training, demonstrating improved perplexity and training speed under stress conditions.
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
The author announces the release of their first AI research paper, STAM (Stable Training with Adaptive Momentum), a new deep learning optimizer addressing stability and resource efficiency, and invites feedback from the AI community.
This paper proposes a covariance-aware variant of Group Relative Policy Optimization (GRPO) that uses Gaussian-kernel advantage reweighting to stabilize training entropy and improve reasoning performance in large language models.
DeepSeek released the full V4 paper detailing FP4 quantization-aware training, MoE training stability tricks (anticipatory routing and SwiGLU clamping), and a generative reward model for RLHF, achieving dramatic efficiency gains—V4-Flash uses only 10% of V3.2's FLOPs and 7% of its KV cache at 1M context length.
Token AI releases a research paper introducing STAM, a new adaptive momentum optimizer designed to improve training stability and reduce memory usage compared to standard optimizers like AdamW.
This paper introduces UniSD, a unified self-distillation framework for adapting large language models that integrates mechanisms for supervision reliability, representation alignment, and training stability. Experimental results show that UniSD improves performance over base models and existing baselines across multiple benchmarks.
This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.
This paper identifies and addresses aggregation bias in GRPO-style reinforcement learning for LLMs, proposing Balanced Aggregation (BA) which improves training stability and final performance by computing token-level means separately for positive and negative subsets.