training-stability

Tag

Cards List
#training-stability

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

arXiv cs.LG · yesterday Cached

This paper introduces FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function that dynamically schedules gradient weights during RL post-training of LLMs, achieving faster convergence and better accuracy-diversity trade-offs compared to static baselines.

0 favorites 0 likes
#training-stability

@cwolferesearch: I just published a blog on agentic RL that covers 10+ recent frameworks in the space. Here are the key takeaways… Link …

X AI KOLs Timeline · 2026-06-22 Cached

A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.

0 favorites 0 likes
#training-stability

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

Hugging Face Daily Papers · 2026-06-15 Cached

This paper analyzes token-level gradient dynamics in RLVR training, revealing how advantage sign and token probability jointly affect update stability, and introduces Winner Advantage Policy Optimization (WAPO) which performs clipped updates only on positive-advantage completions to improve stability.

0 favorites 0 likes
#training-stability

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Hugging Face Daily Papers · 2026-06-02 Cached

AAD-1 introduces asymmetric adversarial distillation with phased training to achieve one-step autoregressive video generation, outperforming prior methods on VBench.

0 favorites 0 likes
#training-stability

@latkins: Fern is one of the best

X AI KOLs Timeline · 2026-05-26 Cached

Fern announces a new regularization technique that solves the SolidGoldMagikarp stability problem, with details to follow in the thread.

0 favorites 0 likes
#training-stability

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Hugging Face Daily Papers · 2026-05-25 Cached

DVAO adaptively weights objectives based on reward variance to improve multi-reward RL training stability and multi-objective performance.

0 favorites 0 likes
#training-stability

Simply Stabilizing the Loop via Fully Looped Transformer

arXiv cs.LG · 2026-05-20 Cached

This paper identifies gradient oscillation and residual explosion as causes of training instability in Looped Transformers, and proposes Fully Looped Transformer with two parameter-free modifications (Fully Looped Architecture and Attention Injection) to stabilize training up to 12 loop iterations, achieving up to 13.2% improvement in downstream performance.

0 favorites 0 likes
#training-stability

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

arXiv cs.AI · 2026-05-20 Cached

This paper introduces LBW-Guard, a bounded autonomous training control governance layer that operates above the AdamW optimizer to monitor telemetry and apply bounded control during training, demonstrating improved perplexity and training speed under stress conditions.

0 favorites 0 likes
#training-stability

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

arXiv cs.LG · 2026-05-19 Cached

Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.

0 favorites 0 likes
#training-stability

A very important milestone for me in the AI field.

Reddit r/LocalLLaMA · 2026-05-16

The author announces the release of their first AI research paper, STAM (Stable Training with Adaptive Momentum), a new deep learning optimizer addressing stability and resource efficiency, and invites feedback from the AI community.

0 favorites 0 likes
#training-stability

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

arXiv cs.CL · 2026-05-13 Cached

This paper proposes a covariance-aware variant of Group Relative Policy Optimization (GRPO) that uses Gaussian-kernel advantage reweighting to stabilize training entropy and improve reasoning performance in large language models.

0 favorites 0 likes
#training-stability

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

Reddit r/MachineLearning · 2026-05-09

DeepSeek released the full V4 paper detailing FP4 quantization-aware training, MoE training stability tricks (anticipatory routing and SwiGLU clamping), and a generative reward model for RLHF, achieving dramatic efficiency gains—V4-Flash uses only 10% of V3.2's FLOPs and 7% of its KV cache at 1M context length.

0 favorites 0 likes
#training-stability

A new generation of AI models and one of the most powerful research papers out there.

Reddit r/LocalLLaMA · 2026-05-08

Token AI releases a research paper introducing STAM, a new adaptive momentum optimizer designed to improve training stability and reduce memory usage compared to standard optimizers like AdamW.

0 favorites 0 likes
#training-stability

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Hugging Face Daily Papers · 2026-05-07 Cached

This paper introduces UniSD, a unified self-distillation framework for adapting large language models that integrates mechanisms for supervision reliability, representation alignment, and training stability. Experimental results show that UniSD improves performance over base models and existing baselines across multiple benchmarks.

0 favorites 0 likes
#training-stability

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Hugging Face Daily Papers · 2026-05-07 Cached

This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.

0 favorites 0 likes
#training-stability

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Hugging Face Daily Papers · 2026-04-14 Cached

This paper identifies and addresses aggregation bias in GRPO-style reinforcement learning for LLMs, proposing Balanced Aggregation (BA) which improves training stability and final performance by computing token-level means separately for positive and negative subsets.

0 favorites 0 likes
← Back to home

Submit Feedback