AIPO: : Learning to Reason from Active Interaction

arXiv cs.CL 05/12/26, 04:00 AM Papers

Summary

This paper introduces AIPO, a reinforcement learning framework that enhances LLM reasoning by allowing the model to actively consult collaborative agents during exploration to overcome capability boundaries.

arXiv:2605.08401v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/12/26, 06:43 AM

# AIPO: : Learning to Reason from Active Interaction
Source: [https://arxiv.org/abs/2605.08401](https://arxiv.org/abs/2605.08401)
[View PDF](https://arxiv.org/pdf/2605.08401)

> Abstract:Recent advances in large language models \(LLMs\) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards \(RLVR\)\. However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model\. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory\-level guidance, which is sample\-inefficient, information\-sparse, and may confine exploration to a static guidance space\. Inspired by the potential of multi\-agent systems, we propose $\\textbf\{AIPO\}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi\-agent interaction during exploration\. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\\textit\{Verify Agent\}$, $\\textit\{Knowledge Agent\}$, and $\\textit\{Reasoning Agent\}$, when encountering reasoning bottlenecks, thereby receiving fine\-grained and targeted guidance to actively expand its capability boundary during training\. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off\-policy bias and gradient vanishing issues that arise when learning from agent\-provided feedback\. After training, the policy model performs reasoning independently without relying on collaborative agents\. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA\-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model\.

## Submission history

From: Junnan Liu \[[view email](https://arxiv.org/show-email/39f1784c/2605.08401)\] **\[v1\]**Fri, 8 May 2026 19:06:55 UTC \(1,405 KB\)

AIPO: : Learning to Reason from Active Interaction

Similar Articles

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

Learning to model other minds

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

Learning to cooperate, compete, and communicate

Submit Feedback

Similar Articles

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

Learning to cooperate, compete, and communicate