AIPO: : Learning to Reason from Active Interaction
Summary
This paper introduces AIPO, a reinforcement learning framework that enhances LLM reasoning by allowing the model to actively consult collaborative agents during exploration to overcome capability boundaries.
View Cached Full Text
Cached at: 05/12/26, 06:43 AM
# AIPO: : Learning to Reason from Active Interaction
Source: [https://arxiv.org/abs/2605.08401](https://arxiv.org/abs/2605.08401)
[View PDF](https://arxiv.org/pdf/2605.08401)
> Abstract:Recent advances in large language models \(LLMs\) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards \(RLVR\)\. However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model\. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory\-level guidance, which is sample\-inefficient, information\-sparse, and may confine exploration to a static guidance space\. Inspired by the potential of multi\-agent systems, we propose $\\textbf\{AIPO\}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi\-agent interaction during exploration\. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\\textit\{Verify Agent\}$, $\\textit\{Knowledge Agent\}$, and $\\textit\{Reasoning Agent\}$, when encountering reasoning bottlenecks, thereby receiving fine\-grained and targeted guidance to actively expand its capability boundary during training\. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off\-policy bias and gradient vanishing issues that arise when learning from agent\-provided feedback\. After training, the policy model performs reasoning independently without relying on collaborative agents\. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA\-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model\.
## Submission history
From: Junnan Liu \[[view email](https://arxiv.org/show-email/39f1784c/2605.08401)\] **\[v1\]**Fri, 8 May 2026 19:06:55 UTC \(1,405 KB\)Similar Articles
IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents
This paper introduces IAPO, a reinforcement learning algorithm that improves tool-calling capabilities in multimodal small language models by aligning input attribution with a stronger teacher. Experiments on Qwen2.5-VL-3B show an average 3% improvement in visual question answering accuracy across six test sets.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
AIP: A Graph Representation for Learning and Governing Agent Skills
The Agent Instruction Protocol (AIP) proposes modeling AI agent skills as directed execution graphs with schema-validated YAML specifications, replacing free-form prose instructions. Experiments show AIP compilation raised Claude Sonnet's task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks.
CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
CiPO is a novel framework for machine unlearning in Large Reasoning Models that uses iterative preference optimization with counterfactual reasoning traces to selectively remove unwanted knowledge while preserving reasoning abilities. The method addresses the challenge of unlearning in models that rely on chain-of-thought reasoning by generating logically valid alternative reasoning paths during training.
RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents
RICE-PO is a critic-free policy optimization framework that turns retrieval interactions into localized credit signals for training reasoning agents, outperforming prompt-based and group-based RL baselines on BRIGHT and BEIR benchmarks.