Tag
This paper introduces TRACE, a framework for turn-aware credit assignment in multi-turn LLM jailbreaking attacks using reinforcement learning, claiming significant improvements in attack success rates and defense alignment.
The paper introduces MemQ, a method that integrates Q-learning into self-evolving memory agents by using eligibility traces over provenance DAGs to solve credit assignment problems in episodic memory retrieval.
This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.
This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.
Introduces IOP, a framework that internalizes outcome supervision into process supervision for reasoning reinforcement learning, enabling fine-grained credit assignment without external annotations.
This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.
This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.
SAVOIR framework applies cooperative game theory and Shapley values to train language agents with improved social intelligence, achieving SOTA on SOTOPIA benchmark and matching GPT-4o performance.