Tag
This paper introduces DASH, a method that uses intermediate answer commitments within reasoning traces to assign segment-level credit, reducing overthinking behaviors and improving accuracy on competition-level math benchmarks.
Introduces TRIAGE, a role-typed credit assignment framework that improves agentic reinforcement learning by providing more nuanced credit assignment than standard GRPO methods, using a structured judge to classify action segments and assign process rewards based on semantic roles.
This paper proposes Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment (LCA), a framework that jointly learns credit assignment and reward modeling under a weakest-link principle, formulated as a Multiple Instance Learning problem with Softmax-Weighted-Sum pooling. Experiments show it outperforms existing outcome-supervised PRMs across multiple tasks.
TACO introduces a novel credit optimization method for code-tool agents that uses a differential reward probe and outcome-gated advantage routing to distinguish useful from redundant or misleading tool calls, improving multimodal agent performance.
BiPACE introduces a drop-in advantage estimator that fixes state-action credit mismatch in stepwise group-based RL for LLM agents, using bisimulation-guided state clustering and action counterfactual estimation, achieving significant performance gains on ALFWorld, WebShop, and TextCraft with Qwen2.5 models.
Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.
This paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC) for fine-tuning Vision-Language-Action (VLA) policies using online reinforcement learning with sparse binary episode outcomes. HABC separates viability and efficiency objectives via adaptive critic heads and intervention-aware credit assignment, significantly improving success rates on contact-rich bimanual manipulation tasks.
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
This paper introduces a learnable channel-class assignment mechanism for forward-only convolutional neural networks, combined with entropy and orthogonality regularization and a loss-aware layer contribution strategy. The method achieves state-of-the-art performance among forward-forward algorithms on CIFAR-10, CIFAR-100, and Tiny-ImageNet, significantly narrowing the gap with backpropagation.
This paper proposes LEAF, a retrospective tree-based reinforcement learning method for speech-aware large language model post-training that improves credit assignment without online branching. LEAF outperforms GRPO on speech question answering and speech translation benchmarks.
PBSD proposes a Bayesian self-distillation method that converts sparse final rewards into calibrated turn-level credit signals for long-horizon agentic tasks, improving policy learning and generalization.
StepPO introduces a step-centric paradigm for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming token-centric methods in multi-turn interaction tasks.
This paper identifies a structural failure mode in token-level credit assignment for LLM reinforcement learning when using LoRA, where intrinsic signals degenerate. It proposes Adapter-Residual Credit Assignment (ARCA), which derives token salience from adapter hidden-state residuals and remains competitive with baselines.
This paper presents SPADER, a reinforcement learning framework for multi-answer QA that uses step-wise peer advantage for credit assignment and diversity-aware exploration rewards to improve recall of long-tail entities, achieving better performance on several benchmarks.
Introduces Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment that generalizes to differentiable loss families including cross-entropy, Bregman divergences, and proper scoring rules. The work provides theoretical grounding for the three-factor learning rule and demonstrates improved performance over existing broadcast approaches on CIFAR-10 and Tiny ImageNet.
DecomposeR introduces a planner-centric reinforcement learning framework that represents research plans as typed DAGs, enabling finer-grained optimization of planning and execution for deep research tasks, achieving 5.1–8.0 point improvements over open baselines.
VeriGate extends GRPO with verifier-gated step-level supervision, providing fine-grained credit assignment when verifier rewards are degenerate. It achieves substantial accuracy improvements on reasoning benchmarks for 1.5B and 7B models.
This paper introduces PRO-CUA, a process-reward optimization framework for training Computer Use Agents (CUAs) using iterative step-level reinforcement learning. The method decouples on-policy environment interaction from policy optimization, enabling dense credit assignment without relying on expert trajectories, and demonstrates effectiveness on live web benchmarks.
This paper introduces Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, consistently outperforming GRPO and DAPO baselines on text-to-image generation and chain-of-thought reasoning benchmarks.