trust-region

#trust-region

Trust Region Policy Distillation

Hugging Face Daily Papers ↗ · 2026-07-06 Cached

Trust Region Policy Distillation (TOP-D) stabilizes on-policy distillation by dynamically constructing a proximal teacher, providing theoretical convergence guarantees and empirical gains in mathematical reasoning tasks with zero additional overhead.

0 favorites 0 likes

#trust-region

KLip-PPO: A per-sample KL perspective on PPO-Clip

arXiv cs.LG ↗ · 2026-06-24 Cached

This paper shows that the gradient of the clipped surrogate in Proximal Policy Optimization (PPO) is exactly reproduced by a per-sample Kullback-Leibler penalty with a variable coefficient, revealing structural features of the clipped surrogate and suggesting new design directions.

0 favorites 0 likes

#trust-region

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

arXiv cs.LG ↗ · 2026-06-18 Cached

TRIDENT is a novel multi-agent reinforcement learning framework that breaks the coupling between hybrid discrete-continuous actions, hard safety constraints, and physics-governed dynamics, achieving provably safe coordination with a convergence guarantee to a constrained Nash equilibrium and significant reductions in training-time violations.

0 favorites 0 likes

#trust-region

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

This paper introduces CPPO, a method that improves reinforcement learning with verifiable rewards for LLMs by using position-weighted thresholds and cumulative prefix budgeting to address limitations of uniform token-level trust regions.

0 favorites 0 likes

#trust-region

Rethinking the Divergence Regularization in LLM RL

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

This paper introduces DRPO, which replaces the hard mask in DPPO with a smooth advantage-weighted quadratic regularizer to improve stability and efficiency in LLM reinforcement learning by providing continuous gradient corrections beyond trust-region boundaries.

0 favorites 0 likes

#trust-region

Trust Region On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.

0 favorites 0 likes

#trust-region

Trust-Region Behavior Blending for On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.

0 favorites 0 likes

#trust-region

Trust Region Q Adjoint Matching

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.

0 favorites 0 likes

#trust-region

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.

0 favorites 0 likes

#trust-region

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces Trust Region Inverse Reinforcement Learning (TRIRL), a method that combines monotonic dual improvement with efficient local policy updates to outperform state-of-the-art imitation learning methods. It addresses the trade-off between stability and computational cost in IRL by using trust-region constraints.

0 favorites 0 likes

trust-region

Submit Feedback