AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

arXiv cs.CL Papers

Summary

Proposes AMATA, a multi-agent trajectory alignment framework for knowledge-intensive question answering that introduces intra-trajectory preference learning and inter-agent dependency learning to improve factual grounding and interpretability, outperforming baselines on five benchmarks.

arXiv:2605.17352v1 Announce Type: new Abstract: Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:39 AM

# AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering
Source: [https://arxiv.org/html/2605.17352](https://arxiv.org/html/2605.17352)
Taolin Zhang1, Dongyang Li2, Chen Chen4, Qizhou Chen3, Jiuheng Wan1, Xiaofeng He3, Chengyu Wang5,Richang Hong111footnotemark:1 1School of Computer Science and Information Engineering, Hefei University of Technology 2Shanghai University of Electric Power3East China Normal University 4Guangdong University of Finance and Economics5Alibaba Group tlzhang@hfut\.edu\.cn, chengyu\.wcy@alibaba\-inc\.com

###### Abstract

Despite substantial advances in large language models \(LLMs\), generating factually consistent responses for knowledge\-intensive question answering remains challenging\. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long\-tail knowledge gaps\. To address this, we propose*AMATA*, an Adaptive Multi\-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding\. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning\. We formalize multi\-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question\-aware agent customization and inter\-agent preference harmonization\. AMATA introduces two principal innovations: \(1\)*Intra\-Trajectory Preference Learning*, which learns objective\-oriented preferences to prioritize critical agents, and \(2\)*Inter\-Agent Dependency Learning*, which captures cross\-agent tool dependencies through a novel dependency\-aware direct preference optimization technique\. Empirical results show that*AMATA*consistently outperforms baseline approaches, knowledge\-augmented frameworks, and LLM\-based trajectory systems on five established knowledge\-intensive QA benchmarks\. Further analysis demonstrates the efficiency of our method in reducing token consumption\.

AMATA: Adaptive Multi\-Agent Trajectory Alignment for Knowledge\-Intensive Question Answering

Taolin Zhang1, Dongyang Li2, Chen Chen4, Qizhou Chen3, Jiuheng Wan1, Xiaofeng He3,Chengyu Wang5††thanks:C\. Wang and R\. Hong are co\-corresponding authors\.,Richang Hong111footnotemark:11School of Computer Science and Information Engineering, Hefei University of Technology2Shanghai University of Electric Power3East China Normal University4Guangdong University of Finance and Economics5Alibaba Grouptlzhang@hfut\.edu\.cn, chengyu\.wcy@alibaba\-inc\.com

## 1Introduction

Large language models \(LLMs\) serve as the backbone of modern NLP infrastructureHuet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib8)\), yet they face persistent reliability challenges\. Chief among these are hallucinations that appear superficially plausibleWooet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib4)\)and other undesirable behaviorsYanget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib3)\)\.

Retrieval\-Augmented Generation \(RAG\) provides a mitigation strategy by enabling LLMs to dynamically retrieve up\-to\-date information from external knowledge sources during inferenceRubinet al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib112)\); Xuet al\.\([2024b](https://arxiv.org/html/2605.17352#bib.bib246)\)\. However, RAG incurs its own drawbacks, such as retrieval inaccuraciesZhuet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib245)\)and increased inference latency due to longer contextsZouet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib244)\)\. As a result, research has increasingly shifted towards multi\-agent systems that leverage diverse tooling and cooperative reflection mechanisms to enhance task robustnessXuet al\.\([2024a](https://arxiv.org/html/2605.17352#bib.bib202)\); Yueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\)\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x1.png)Figure 1:Comparison of multi\-agent paradigms for knowledge\-intensive QA\.*Modular Training*and*End\-to\-End Training*focus, respectively, on local agent reasoning and global co\-adaptation, while the*Global\-Local Training*paradigm combines the advantages of both\. Our*AMATA*framework dynamically learns intra\-trajectory relationships between questions and LLM agents, and establishes inter\-agent dependencies across the agent ensemble\.Prior multi\-agent approaches for knowledge\-intensive QA can be broadly categorized into three paradigms:*Modular Training*,*End\-to\-End Training*, and*Global\-Local Training*\. \(1\)*Modular Training*fine\-tunes individual agents on customized datasets targeted to their respective subtask capabilitiesLonget al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib191)\); Koopmanet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib192)\)\. These locally optimized agents are then manually integrated through static workflows, with execution sequences and handoff mechanisms rigidly defined\. The lack of global optimization leads to error propagation and diminished overall performance, as locally optimal agents cannot compensate for inter\-agent dependencies or system\-wide dynamics\. \(2\)*End\-to\-End Training*adopts a unified optimization strategy, jointly training all agents within a single framework using task\-level supervisionZonget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib188)\); Klisuraet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib189)\)\. While backpropagation updates parameters throughout the agent ensemble, enabling co\-adaptation and implicit coordination, uniform gradient updates obscure the distinct specialization demands of heterogeneous agents\. Agents responsible for different subtasksKhotet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib186)\); Yueet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib187)\)require individualized learning signals; parameter sharing can cause over\-homogenization, thereby eroding specialized expertise\. \(3\)*Global\-Local Training*employs a two\-stage process: first, agents are optimized independently for subtask proficiency, then fine\-tuned jointly to align global behaviorsYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\)\. Although this strategy combines localized specialization with global coordination, it often fails to capture dynamic inter\-agent dependencies, which are crucial for handling diverse knowledge\-intensive tasksZhanget al\.\([2025a](https://arxiv.org/html/2605.17352#bib.bib204),[b](https://arxiv.org/html/2605.17352#bib.bib219)\)\. As shown in Figure[1](https://arxiv.org/html/2605.17352#S1.F1), for questions with high confidence, adding a “Verifier” agent after the preceding five agents may be unnecessary\. Additionally, the three knowledge agents \(“Retriever”, “Filter”, and “Locator”\) exhibit strong interdependence; when the “Retriever” is triggered, subsequent actions of the other two agents must also be executed\.

In this paper, we proposeAdaptive Multi\-Agent Trajectory Alignment\(*AMATA*\), a framework designed to improve agent\-level alignment and capture dynamic inter\-agent dependencies\. AMATA maintains high reasoning performance while significantly reducing token overhead during inference\. Our main contributions are summarized below:

Intra\-Trajectory Preference Learning\.Existing approaches commonly treat all agents as uniformly relevant throughout the reasoning process\. In contrast, our method dynamically optimizes agent participation for each question, learning question\-specific agent preference distributions that adaptively modulate each agent’s influence based on utility for the current input\. For each agent and question pair, we assign a preference score \(e\.g\.,*<<Reconstructor: 5\>\>*versus*<<Verifier: 1\>\>*\), concatenated with the agent description as a prefix and paired with the question for fine\-tuning the corresponding agent\.

Inter\-Agent Dependency Learning\.Multi\-agent systems exhibit context\-dependent inter\-agent dependencies, where triggering a pivotal agent necessitates coordinated execution of functionally linked agents, while allowing conditional suppression of unrelated agentsJi and Gao \([2024](https://arxiv.org/html/2605.17352#bib.bib236)\); Gaoet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib237)\)\. We introduce a dependency\-aware Direct Preference Optimization \(DA\-DPO\) module that learns context\-sensitive execution ranking\. Specifically, for each question, we construct preference samples that explicitly encode inter\-agent dependencies and annotate each trajectory with a joint preference score reflecting the global optimality of the multi\-agent sequence\. These scores induce a dependency\-aware ranking over sampled trajectories, prioritizing those with robust inter\-agent coordination for DA\-DPO training\. This mechanism enables LLMs to infer optimal multi\-agent execution sequences with high reliability\.

We evaluate*AMATA*against competitive baselines on five benchmarks: HealthQAAkhtaret al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib227)\), ARC\-ChoiceClarket al\.\([2018](https://arxiv.org/html/2605.17352#bib.bib226)\), PopQAMallenet al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib225)\), SQuAD 1\.1Rajpurkaret al\.\([2016](https://arxiv.org/html/2605.17352#bib.bib1)\), and ASQAGaoet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib224)\)\. Our framework achieves an average performance improvement of\+4\.02%across all tasks, and reduces token consumption overhead by approximately70%compared to strong baselines\.

## 2Related Work

Multi\-Agent Trajectory Learning\.Multi\-agent trajectory learning refers to the process of orchestrating multiple agents to collectively solve complex tasksLi \([2025](https://arxiv.org/html/2605.17352#bib.bib200)\)\. Existing literature can be grouped into three main paradigms: \(1\)*Modular Training*Longet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib191)\); Koopmanet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib192)\)trains agents independently for their respective subtasks\. This often results in suboptimal global performance due to a lack of system\-wide coordination\. Preference learning has been introduced to ameliorate poor decision\-making at the individual agent levelSonget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib199)\); Xionget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib198)\), but overall integration remains a challenge\. \(2\)*End\-to\-End Training*Zonget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib188)\); Klisuraet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib189)\)jointly optimizes all agents using unified loss functions derived from expert trajectories curated by a teacher LLM \(e\.g\., FireActChenet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib197)\), AgentTuningZenget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib196)\)\)\. Other approaches such as MapGPTChenet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib216)\)and LLM\-A∗Menget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib217)\)focus on providing agents with a global view of the environment\. Despite improved coordination, these methods may obscure contributions from specialized agents, potentially hindering the balance between individual expertise and system\-wide collaborationZhanget al\.\([2025b](https://arxiv.org/html/2605.17352#bib.bib219)\)\. \(3\)*Global\-Local Training*combines both global context and local adaptation signals to enhance agent specialization\. For instance, CoActHouet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib218)\)emulates hierarchical human planning in LLMs, while SMARTYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\)leverages multi\-granular trajectories for agent control and system synergy\. These frameworks inject both global and local signals into agent optimization, aiming to preserve both broad task alignment and agent\-level differentiationSubramonianet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib215)\)\. However, these methods often overlook inter\-agent dependencies\.

Knowledge Enhancement for LLMs\.LLMs are prone to hallucinations and lack coverage of long\-tail knowledge due to their parametric natureJiet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib211)\); Liet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib268)\); Huanget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib212)\)\. To ensure factual accuracy, LLMs frequently rely on external sources\. RAG incorporates non\-parametric resources to improve factual reliability and enrich LLM outputsFanet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib210)\); Singhet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib207)\)\. Advancements in this area include better retrieval mechanisms using dense retrieversKarpukhinet al\.\([2020](https://arxiv.org/html/2605.17352#bib.bib160)\); Yeet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib209)\)and improved information integrationZhanget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib269)\); Wanget al\.\([2025b](https://arxiv.org/html/2605.17352#bib.bib206)\); Chenget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib205)\)\. For example, Self\-RAGAsaiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib248)\)introduces reflection tokens to assess both retrieval and response quality during inference\. However, these RAG techniques typically operate within a single\-agent paradigm, executing retrieval and generation in a sequential pipelineSinghet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib207)\), thus failing to exploit the emergent capabilities and cooperative reasoning potential of multi\-agent LLM frameworksQianet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib263)\)\. In contrast,*AMATA*is designed for knowledge\-intensive QA tasks in a multi\-agent setting\.

## 3Methodology

AgentHeadEndIntent Reconstructor𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}⟨Reconstructor⟩\\langle\\text\{Reconstructor\}\\rangle⟨/eoi⟩\\langle\\text\{/eoi\}\\rangleKnowledge Retriever𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}⟨Retriever⟩\\langle\\text\{Retriever\}\\rangle⟨/eor⟩\\langle\\text\{/eor\}\\rangleKnowledge Filter𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}⟨Filter⟩\\langle\\text\{Filter\}\\rangle⟨/eof⟩\\langle\\text\{/eof\}\\rangleKnowledge Locator𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}⟨Locator⟩\\langle\\text\{Locator\}\\rangle⟨/eol⟩\\langle\\text\{/eol\}\\rangleResponse Generator𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}⟨Generator⟩\\langle\\text\{Generator\}\\rangle⟨/eog⟩\\langle\\text\{/eog\}\\rangleAnswer Verifier𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}⟨Verifier⟩\\langle\\text\{Verifier\}\\rangle⟨/eov⟩\\langle\\text\{/eov\}\\rangleTable 1:Agents and special tokens used in trajectories\. Detailed agent descriptions and the trajectory data collection process are provided in Appendices[A\.1](https://arxiv.org/html/2605.17352#A1.SS1)and[A\.4](https://arxiv.org/html/2605.17352#A1.SS4)\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x2.png)Figure 2:Comparison between*AMATA*and standard SFT and DPO pipelines\.*AMATA*optimizes intra\-trajectory preferences and inter\-agent dependencies through adaptive prefix scoring \(left\) and DA\-DPO \(right\)\.### 3\.1Task Formulation and Basic Notations

Figure[2](https://arxiv.org/html/2605.17352#S3.F2)illustrates the overall architecture of*AMATA*\. Given a question𝒬\\mathcal\{Q\}, we design a workflow utilizing an LLM\-based multi\-agent system to generate the answer𝒴\\mathcal\{Y\}, where𝒴=ℱ​\(𝒬,𝐀\)\\mathcal\{Y\}=\\mathcal\{F\}\(\\mathcal\{Q\},\\mathbf\{A\}\)\. Here,ℱ\\mathcal\{F\}denotes the entire system parameterized by learnable weights, and𝐀\\mathbf\{A\}is the set of agents, such as the six agents in*AMATA*\(see Table[1](https://arxiv.org/html/2605.17352#S3.T1)\)\. In this framework, each agent𝒜∈𝐀\\mathcal\{A\}\\in\\mathbf\{A\}receives the current state and produces three outputs: a responseyiy\_\{i\}, a special end tokeneie\_\{i\}, and a special head tokenhi\+1h\_\{i\+1\}for the subsequent agent, expressed as:

yi,ei,hi\+1=𝒜​\(𝒬,yi−1,ei−1,hi\),\\displaystyle y\_\{i\},e\_\{i\},h\_\{i\+1\}=\\mathcal\{A\}\(\\mathcal\{Q\},y\_\{i\-1\},e\_\{i\-1\},h\_\{i\}\),\(1\)where𝒯=\{\(h1,y1,e1\),…,\(hT,yT,eT\)\}\\mathcal\{T\}=\\\{\(h\_\{1\},y\_\{1\},e\_\{1\}\),\\dots,\(h\_\{T\},y\_\{T\},e\_\{T\}\)\\\}denotes a complete trajectory realized by dynamically executing the workflowℱ\\mathcal\{F\}\. The final output𝒴\\mathcal\{Y\}is obtained after completing this trajectory\. In LLM\-based multi\-agent systems, conditional autoregressive language modeling is typically adopted to learn which agent should act and when, utilizing these special tokens to coordinate agent behaviorsKwonet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib240)\); Tanget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib241)\); Yueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\)\. The trajectory\-wise objective function is defined asℒ​\(𝒯\)=∑i=1T−log⁡Pr⁡\(ti∣t<i,𝒬\)\\mathcal\{L\}\(\\mathcal\{T\}\)=\\sum\_\{i=1\}^\{T\}\-\\log\\Pr\\left\(t\_\{i\}\\mid t\_\{<i\},\\mathcal\{Q\}\\right\), whereti=\(hi,yi,ei\)t\_\{i\}=\(h\_\{i\},y\_\{i\},e\_\{i\}\)represents theii\-th tuple in the trajectory𝒯\\mathcal\{T\}andt<it\_\{<i\}comprises all preceding tuples in the trajectory\.

### 3\.2Intra\-Trajectory Preference Learning

Agents in a multi\-agent system exhibit heterogeneous capabilities, necessitating autonomous tool usage and adaptive coordination\. For example, for simple questions, the workflow may not require external knowledge retrieval or output verification \(e\.g\.,𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}\)\. In such cases, the workflow only formalizes the question via𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}and uses the generator to produce an answer via𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}\. To model agent\-specific tool usage within a trajectory, a common approach is supervised fine\-tuning \(SFT\), which enhances the tool\-handling skills of individual agents:

ℒSFT\(j\)​\(Θ\)=−𝔼\(𝒬,𝒴\)∼𝒟intra\(j\)​log⁡Pr⁡\(𝒴∣𝒬;Θ\),\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}^\{\(j\)\}\(\\Theta\)=\-\\mathbb\{E\}\_\{\(\\mathcal\{Q\},\\mathcal\{Y\}\)\\sim\\mathcal\{D\}\_\{\\text\{intra\}\}^\{\(j\)\}\}\\log\\Pr\(\\mathcal\{Y\}\\mid\\mathcal\{Q\};\\Theta\),\(2\)where𝒟intra\(j\)\\mathcal\{D\}\_\{\\text\{intra\}\}^\{\(j\)\}is the subset of the intra\-trajectory dataset𝒟intra\\mathcal\{D\}\_\{\\text\{intra\}\}corresponding to thejj\-th agent, andΘ\\Thetadenotes intra\-trajectory model parameters\. The agent’s description and functionalities are incorporated into𝒬\\mathcal\{Q\}via prompt engineering\.

While agent\-specific training can be effective, it requires both agent\-specific datasets and significant computational resources, limiting scalability\. To address this, we consolidate preference learning within a unified trajectory sampling framework, enabling a monolithic model to capture heterogeneous agent competencies through SFT augmented with adaptive prefix scoring:

ℒIntra​\(Θ\)=−𝔼\(𝒬,𝒴,P\)∼𝒟intra​log⁡Pr⁡\(𝒴∣𝒫,𝒬;Θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{Intra\}\}\(\\Theta\)=\-\\mathbb\{E\}\_\{\(\\mathcal\{Q\},\\mathcal\{Y\},P\)\\sim\\mathcal\{D\}\_\{\\text\{intra\}\}\}\\log\\Pr\(\\mathcal\{Y\}\\mid\\mathcal\{P\},\\mathcal\{Q\};\\Theta\)\(3\)where𝒫=\{P𝒜IR,…,P𝒜AV\}\\mathcal\{P\}=\\\{P\_\{\\mathcal\{A\}\_\{\\text\{IR\}\}\},\\dots,P\_\{\\mathcal\{A\}\_\{\\text\{AV\}\}\}\\\}represents the preference prefix set for each agent in the trajectory, indicating their relative importance for a given sample \(e\.g\.,*<<Retriever: 5\>\>*\)\. These scores reflect the importance of each agent in correctly answering the question, as annotated by LLMs\.111Prompt templates are provided in Appendix[A\.1](https://arxiv.org/html/2605.17352#A1.SS1)\.

Consider the intra\-trajectory sample illustrated in Fig\.[3](https://arxiv.org/html/2605.17352#S3.F3)\. The reference to “Revenge of the Nerds” requires background knowledge to support the LLM’s response\. Accordingly, the⟨\\langleRetriever⟩\\rangle,⟨\\langleFilter⟩\\rangle, and⟨\\langleLocator⟩\\rangleagents receive higher preference scores, reflecting a stronger need for knowledge tools\. When knowledge agents yield high\-confidence outputs, verification becomes redundant, resulting in a low preference score for the⟨\\langleVerifier⟩\\rangleagent\. This example demonstrates that question complexity induces a dynamic, heterogeneous agent hierarchy within the reasoning trajectory, and underscores the need for a framework that supports flexible, token\-efficient tool orchestration\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x3.png)Figure 3:Two\-stage training examples of*AMATA*\. Detailed annotation process and robustness verification for the*DEPENDENCY SCORES*are provided in Appendix[A\.1\.2](https://arxiv.org/html/2605.17352#A1.SS1.SSS2)\.
### 3\.3Inter\-Agent Dependency Learning

Functional dependencies naturally exist between agents within a trajectory, as downstream agents rely on responses and states generated by upstream agents in order to be triggered and operate effectivelyGaoet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib237)\)\. In*AMATA*, such dependencies are especially pronounced: for instance, if the retrieval agent𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}is activated, then the knowledge filtering agent𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}and the locating agent𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}must also be executed\. To model these interactions, we introduce inter\-agent dependency learning, leveraging pairs of winning and losing samples for Direct Preference Optimization \(DPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib220)\)\. This enables the model to automatically discover optimal patterns of agent collaboration and to capture the underlying dependency structures among agents\.

A widely adopted technique combines the Bradley–Terry \(BT\) modelBradley and Terry \([1952](https://arxiv.org/html/2605.17352#bib.bib234)\)with DPO to parameterize the reward function for trajectory selection\. Formally, the probability that the winning responseywy\_\{w\}is preferred over the losing responseyly\_\{l\}for a given instruction𝒬\\mathcal\{Q\}is:

Pr⁡\(yw≻yl∣𝒬\)=exp⁡\(r​\(𝒬,yw\)\)exp⁡\(r​\(𝒬,yw\)\)\+exp⁡\(r​\(𝒬,yl\)\)\\Pr\(y\_\{w\}\\succ y\_\{l\}\\mid\\mathcal\{Q\}\)=\\frac\{\\exp\(r\(\\mathcal\{Q\},y\_\{w\}\)\)\}\{\\exp\(r\(\\mathcal\{Q\},y\_\{w\}\)\)\+\\exp\(r\(\\mathcal\{Q\},y\_\{l\}\)\)\}\(4\)where the rewardr​\(𝒬,y\)r\(\\mathcal\{Q\},y\)measures the policy model’s preference foryyand is given byr​\(𝒬,y\)=β⋅log⁡πΘ~​\(y∣𝒬\)πref​\(y∣𝒬\)\+β⋅log⁡Z​\(𝒬\)r\(\\mathcal\{Q\},y\)=\\beta\\cdot\\log\\frac\{\\pi\_\{\\tilde\{\\Theta\}\}\(y\\mid\\mathcal\{Q\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\\mid\\mathcal\{Q\}\)\}\+\\beta\\cdot\\log\\mathrm\{Z\}\(\\mathcal\{Q\}\), withπref\\pi\_\{\\mathrm\{ref\}\}andπΘ~\\pi\_\{\\tilde\{\\Theta\}\}denoting the intra\-trajectory and inter\-agent dependency models, respectively\. The coefficientβ\\betamodulates the strength of regularization, andZ​\(𝒬\)=∑yπref​\(y∣𝒬\)​exp⁡\(1β​r​\(𝒬,y\)\)\\mathrm\{Z\}\(\\mathcal\{Q\}\)=\\sum\_\{y\}\\pi\_\{\\mathrm\{ref\}\}\(y\\mid\\mathcal\{Q\}\)\\exp\\left\(\\frac\{1\}\{\\beta\}r\(\\mathcal\{Q\},y\)\\right\)denotes the partition function\.

However, merely distinguishing between winning and losing samples is insufficient for optimal dependency modeling\. Our analysis \(see Sect\.[4\.3](https://arxiv.org/html/2605.17352#S4.SS3)\) reveals that specific combinations of agent preference scores in the trajectory prefix are correlated with higher response quality, compared to settings that neglect agent dependency information\. For example, in the inter\-trajectory samples presented in Fig\.[3](https://arxiv.org/html/2605.17352#S3.F3), although bothyw1y\_\{w\}^\{1\}andyw2y\_\{w\}^\{2\}yield correct responses,yw1y\_\{w\}^\{1\}should receive a higher trajectory policy preference due to its consistently elevated scores for tightly coupled agents \(such as⟨\\langleRetriever⟩\\rangle,⟨\\langleFilter⟩\\rangle, and⟨\\langleLocator⟩\\rangle\)\. Moreover, both of these trajectories outperform all losing samples\. To address this, we propose the*dependency\-aware DPO*\(DA\-DPO\) algorithm in*AMATA*, guiding the policy model to better capture relative dependency relationships among agent preference scores\.

Task→\\rightarrowHealthQAARC\-CPopQASquad1ASQAAverageModel↓\\downarrowAcc\.Acc\.Acc\.Acc\.Str\_EMRouge\-LMauveVanilla QA MethodsAlpaca27B44\.78\(±1\.2\)36\.43\(±1\.5\)25\.58\(±0\.8\)11\.50\(±1\.1\)14\.42\(±1\.6\)28\.72\(±2\.1\)51\.24\(±0\.9\)30\.38\(±1\.1\)Mistral\-Instruct7B65\.45\(±1\.4\)57\.84\(±0\.7\)22\.37\(±1\.3\)14\.97\(±1\.3\)20\.80\(±2\.2\)32\.20\(±0\.9\)33\.47\(±1\.8\)35\.30\(±0\.8\)Llama\-2\-Chat7B47\.95\(±1\.9\)47\.95\(±1\.1\)25\.44\(±0\.8\)14\.13\(±0\.7\)16\.79\(±2\.3\)32\.35\(±1\.6\)24\.21\(±1\.2\)29\.83\(±1\.5\)Vicuna\-v1\.513B63\.01\(±2\.0\)57\.59\(±0\.9\)17\.94\(±1\.5\)15\.25\(±1\.8\)31\.95\(±2\.2\)22\.99\(±1\.7\)68\.41\(±1\.4\)39\.59\(±1\.3\)Llama\-2\-Chat13B62\.20\(±1\.8\)48\.72\(±2\.2\)21\.22\(±1\.9\)15\.97\(±1\.4\)19\.97\(±0\.7\)30\.37\(±1\.3\)40\.23\(±1\.5\)34\.10\(±1\.7\)Qwen\-2\.5\-Ins\.7B64\.02\(±1\.1\)51\.38\(±1\.7\)22\.35\(±0\.7\)17\.23\(±1\.2\)18\.99\(±1\.3\)31\.65\(±2\.0\)47\.03\(±1\.1\)36\.09\(±1\.2\)GPT\-3\.5\-turbo76\.08\(±1\.5\)77\.30\(±0\.8\)29\.30\(±1\.2\)22\.90\(±1\.6\)39\.94\(±1\.4\)35\.73\(±0\.7\)44\.63\(±2\.3\)46\.55\(±1\.4\)Knowledge\-augmented MethodsAlpaca27B26\.44\(±1\.7\)35\.15\(±1\.4\)33\.38\(±1\.6\)21\.41\(±2\.2\)23\.59\(±0\.8\)27\.21\(±2\.3\)50\.09\(±1\.5\)31\.04\(±1\.2\)REPLUG7B\\text\{REPLUG\}\_\{\\text\{ 7B\}\}41\.72\(±1\.3\)47\.26\(±0\.8\)37\.24\(±1\.5\)24\.23\(±1\.7\)26\.54\(±2\.2\)33\.25\(±0\.9\)54\.03\(±1\.4\)37\.75\(±1\.7\)VANILLA7B\*29\.52\(±1\.6\)42\.74\(±2\.2\)37\.52\(±1\.7\)25\.92\(±1\.4\)32\.25\(±1\.5\)34\.93\(±0\.8\)39\.54\(±2\.3\)34\.63\(±1\.1\)RADIT7B52\.98\(±1\.5\)62\.10\(±1\.7\)38\.02\(±2\.3\)23\.86\(±0\.8\)25\.68\(±1\.4\)15\.99\(±1\.6\)12\.35\(±1\.9\)33\.00\(±1\.2\)INTERACT7B\\text\{INTERACT\}\_\{\\text\{ 7B\}\}65\.45\(±0\.8\)48\.12\(±1\.3\)41\.31\(±1\.6\)31\.52\(±1\.4\)34\.54\(±1\.7\)35\.51\(±2\.2\)43\.45\(±1\.5\)42\.84\(±0\.9\)SelfRag7B\*68\.99\(±1\.4\)65\.52\(±0\.6\)40\.67\(±0\.8\)22\.39\(±1\.3\)28\.68\(±1\.5\)34\.11\(±1\.7\)83\.00\(±2\.1\)49\.05\(±1\.8\)LLM\-based Trajectory MethodsMMAgent7B72\.56\(±0\.7\)64\.43\(±1\.3\)37\.92\(±1\.5\)24\.62\(±0\.8\)34\.13\(±1\.6\)37\.25\(±1\.2\)90\.11\(±2\.0\)51\.57\(±1\.3\)SMART7B73\.90\(±1\.6\)67\.31\(±1\.4\)42\.88\(±0\.8\)29\.24\(±1\.3\)42\.56\(±1\.7\)41\.71\(±1\.5\)92\.32\(±2\.2\)55\.70\(±2\.1\)SPA\-RL7B†\\text\{SPA\-RL\}\\text\{ 7B\}^\{\\dagger\}73\.23\(±1\.2\)68\.53\(±0\.7\)42\.72\(±2\.1\)29\.46\(±1\.6\)43\.73\(±0\.8\)41\.37\(±1\.3\)91\.02\(±1\.5\)55\.72\(±2\.6\)GiGPO7B†\\text\{GiGPO\}\\text\{ 7B\}^\{\\dagger\}73\.97\(±1\.5\)68\.01\(±1\.8\)43\.52\(±1\.3\)29\.97\(±1\.7\)43\.88\(±1\.4\)43\.64\(±1\.1\)92\.83\(±1\.2\)56\.55\(±2\.1\)AMATA7B75\.83\(±0\.8\)72\.47\(±1\.6\)47\.39\(±1\.2\)34\.61\(±1\.5\)49\.10\(±1\.7\)48\.26\(±1\.3\)96\.35\(±1\.4\)60\.57\(±1\.1\)Table 2:Overall results of*AMATA*\. Results of GPT\-3\.5\-turbo are for reference only\.∗\*indicates re\-implemented methods based on the same model\.†\\daggerdenotes the RL settings described in Appendix[A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1)\. Results for other LLMs are shown in Appendix[B\.3](https://arxiv.org/html/2605.17352#A2.SS3)\.Boldnumbers represent the best results, whileunderlinedindicate the second\-best\.Specifically, givenMMwinning andNNlosing samples for a question𝒬\\mathcal\{Q\}, we first select the top\-KKwinning samples based on their*dependency scores*, which measure the “goodness” of dependency among agent preference scores as determined by prior knowledge\. We treat the remainingM−KM\-Kwinning samples as losing ones due to their lower “goodness” of dependencies\. These samples are denoted as\(yw1,…,ywK,ywK\+1,…,ywM\)\(y\_\{w\}^\{1\},\\dots,y\_\{w\}^\{K\},y\_\{w\}^\{K\+1\},\\dots,y\_\{w\}^\{M\}\)and\(ylM\+1,…,ylM\+N\)\(y\_\{l\}^\{M\+1\},\\dots,y\_\{l\}^\{M\+N\}\), where\(ywK\+1,…,ywM\)\(y\_\{w\}^\{K\+1\},\\dots,y\_\{w\}^\{M\}\)and\(ylM\+1,…,ylM\+N\)\(y\_\{l\}^\{M\+1\},\\dots,y\_\{l\}^\{M\+N\}\)are treated as losing samples\. Inspired by listwise Plackett\-Luce preference modelingPlackett \([1975](https://arxiv.org/html/2605.17352#bib.bib233)\), we define the inter\-agent dependency model as follows:

Pr⁡\(yw1≻yw2≻⋯≻ywK≻\{ywK\+1,…,ylM\+N\}∣𝒬\)\\displaystyle\\Pr\\left\(y\_\{w\}^\{1\}\\succ y\_\{w\}^\{2\}\\succ\\cdots\\succ y\_\{w\}^\{K\}\\succ\\\{y\_\{w\}^\{K\+1\},\\dots,y\_\{l\}^\{M\+N\}\\\}\\mid\\mathcal\{Q\}\\right\)\(5\)=∑fK\+1M\+N∏i=1M\+N−1fℰ​\(𝒬,yi\)\\displaystyle=\\sum\_\{f\_\{K\+1\}^\{M\+N\}\}\\prod\_\{i=1\}^\{M\+N\-1\}f\_\{\\mathcal\{E\}\}\(\\mathcal\{Q\},y\_\{i\}\)=∏i=1Kfℰ​\(𝒬,yi\)⋅∑fK\+1M\+N∏i=K\+1M\+N−1fℰ​\(𝒬,yi\)\\displaystyle=\\prod\_\{i=1\}^\{K\}f\_\{\\mathcal\{E\}\}\(\\mathcal\{Q\},y\_\{i\}\)\\cdot\\sum\_\{f\_\{K\+1\}^\{M\+N\}\}\\prod\_\{i=K\+1\}^\{M\+N\-1\}f\_\{\\mathcal\{E\}\}\(\\mathcal\{Q\},y\_\{i\}\)=∏i=1Kfℰ​\(𝒬,yi\)⋅∑fK\+1M\+NPr⁡\(yK\+1≻⋯≻yM\+N∣𝒬\)\\displaystyle=\\prod\_\{i=1\}^\{K\}f\_\{\\mathcal\{E\}\}\(\\mathcal\{Q\},y\_\{i\}\)\\cdot\\sum\_\{f\_\{K\+1\}^\{M\+N\}\}\\Pr\(y\_\{K\+1\}\\succ\\cdots\\succ y\_\{M\+N\}\\mid\\mathcal\{Q\}\)=∏i=1Kfℰ​\(𝒬,yi\)\\displaystyle=\\prod\_\{i=1\}^\{K\}f\_\{\\mathcal\{E\}\}\(\\mathcal\{Q\},y\_\{i\}\)wherefK\+1M\+Nf\_\{K\+1\}^\{M\+N\}denotes the set of all permutations of\(yK\+1,…,yM\+N\)\(y\_\{K\+1\},\\ldots,y\_\{M\+N\}\)andfℰ​\(𝒬,yi\)=e​x​p​\(r​\(𝒬,y\)\)∑j=iM\+Ne​x​p​\(r​\(𝒬,y\)\)f\_\{\\mathcal\{E\}\}\(\\mathcal\{Q\},y\_\{i\}\)=\\frac\{exp\(r\(\\mathcal\{Q\},y\)\)\}\{\\sum\_\{j=i\}^\{M\+N\}exp\(r\(\\mathcal\{Q\},y\)\)\}\. The set\{ywK\+1,…,ylM\+N\}\\\{y\_\{w\}^\{K\+1\},\\dots,y\_\{l\}^\{M\+N\}\\\}denotes the rejected trajectory set for inter\-agent dependency learning, including\(M−K\)\(M\-K\)original winning samples andNNlosing samples\.

By substituting the reward from Eq\. \(7\) into the probability maximization objectivePr⁡\(yw1≻⋯≻ywK≻\{ywK\+1,…,ylM\+N\}∣𝒬\)\\Pr\(y\_\{w\}^\{1\}\\succ\\cdots\\succ y\_\{w\}^\{K\}\\succ\\\{y\_\{w\}^\{K\+1\},\\dots,y\_\{l\}^\{M\+N\}\\\}\\mid\\mathcal\{Q\}\), we obtain the objective of the inter\-agent dependency loss:

ℒInter​\(Θ~\)=−𝔼\(𝒬,yw1,…,ylL\)∼𝒟inter​ℋ​\(𝒬,y\)\\displaystyle\\mathcal\{L\}\_\{\\text\{Inter\}\}\(\\tilde\{\\Theta\}\)=\-\\mathbb\{E\}\_\{\(\\mathcal\{Q\},y\_\{w\}^\{1\},\\ldots,y\_\{l\}^\{L\}\)\\sim\\mathcal\{D\}^\{\\text\{inter\}\}\}\\mathcal\{H\}\(\\mathcal\{Q\},y\)\(6\)ℋ​\(𝒬,y\)=∑i=1Klog⁡σ​\(−log​∑j=i\+1M\+Nexp⁡𝒱β\)\\displaystyle\\mathcal\{H\}\(\\mathcal\{Q\},y\)=\\sum\_\{i=1\}^\{K\}\\log\\sigma\\left\(\-\\log\\sum\_\{j=i\+1\}^\{M\+N\}\\exp\\mathcal\{V\}\_\{\\beta\}\\right\)\(7\)where𝒱β=β​log⁡πΘ~​\(yj∣𝒬\)πref​\(yj∣𝒬\)−β​log⁡πΘ~​\(yi∣𝒬\)πref​\(yi∣𝒬\)\\mathcal\{V\}\_\{\\beta\}=\\beta\\log\\frac\{\\pi\_\{\\tilde\{\\Theta\}\}\(y\_\{j\}\\mid\\mathcal\{Q\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{j\}\\mid\\mathcal\{Q\}\)\}\-\\beta\\log\\frac\{\\pi\_\{\\tilde\{\\Theta\}\}\(y\_\{i\}\\mid\\mathcal\{Q\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(y\_\{i\}\\mid\\mathcal\{Q\}\)\}, and𝒟inter\\mathcal\{D\}^\{\\text\{inter\}\}andΘ~\\tilde\{\\Theta\}denote the inter\-agent training samples and parameters, respectively\.

### 3\.4Model Training and Inference

Our*AMATA*framework undergoes two\-stage training\. First, we perform intra\-trajectory preference learning usingℒIntra​\(Θ\)\\mathcal\{L\}\_\{\\text\{Intra\}\}\(\\Theta\), which enables the model to acquire varying degrees of perception regarding agent utilization within trajectories\. Next, we combine the basic agent prediction lossℒ​\(𝒯\)\\mathcal\{L\}\(\\mathcal\{T\}\)and the inter\-agent dependency lossℒInter​\(Θ~\)\\mathcal\{L\}\_\{\\text\{Inter\}\}\(\\tilde\{\\Theta\}\)to form the total loss:

ℒtotal=α1⋅ℒ​\(𝒯\)\+α2⋅ℒInter​\(Θ~\),\\mathcal\{L\}\_\{\\text\{total\}\}=\\alpha\_\{1\}\\cdot\\mathcal\{L\}\(\\mathcal\{T\}\)\+\\alpha\_\{2\}\\cdot\\mathcal\{L\}\_\{\\text\{Inter\}\}\(\\tilde\{\\Theta\}\),\(8\)thereby enhancing multi\-agent cooperation in knowledge\-intensive QA tasks\. Here,α1\\alpha\_\{1\}andα2\\alpha\_\{2\}are training coefficients that sum to 1\. Due to space constraints, we refer readers to Appendix[C](https://arxiv.org/html/2605.17352#A3)for inference algorithm details\.

## 4Experiments

We conduct extensive experiments to evaluate*AMATA*\. Due to space limitations, details regarding trajectory data collection, baselines, and implementation are provided in Appendix[A](https://arxiv.org/html/2605.17352#A1)\.

### 4\.1Main Results

As shown in Table[2](https://arxiv.org/html/2605.17352#S3.T2), the key observations are as follows: \(1\) Compared to standard QA baselines,*AMATA*significantly outperforms models with comparable or even larger parameter sizes\. Notably, it incorporates external knowledge through trajectory learning, compensating for the parameter\-size gap relative to larger backbones such as Vicuna\-v1\.5 \(13B\) and Llama2\-13B\-Chat, especially on long\-tail knowledge tasks \(i\.e\., PopQA, SQuAD 1\.1, and ASQA\) in comparison to GPT\-3\.5\-turbo\. \(2\) Knowledge\-augmented methods leverage retrievers to access external data and assist LLMs in generating informed answers; however, data noise and excessive augmentation can significantly degrade model performanceFanget al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib223)\)\. Notably, despite sharing the same training data and backbone as RADITLinet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib222)\), our model achieves a higher fluency score as measured byMAUVE\. \(3\) We also compare against LLM\-based trajectory methods, including the independent\-agent modular training method MMAgent \(comprising six independent agents\), the global\-local trajectory approach SMARTYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\), and recent long\-trajectory methods such as GiGPOFenget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib36)\)and SPA\-RLWanget al\.\([2025a](https://arxiv.org/html/2605.17352#bib.bib235)\)\. The results indicate that long\-trajectory methods generally outperform multi\-agent training approaches in our setting\. This improvement can be attributed to RL feedback, which mitigates the cumulative propagation of agent\-action errors in longer trajectories\. Our method further encourages both intra\- and inter\-agent dependencies within trajectories, thereby reducing collaborative conflicts among agents\.

Task→\\rightarrowHealthQAARC\-CPopQAASQAModel↓\\downarrowAcc\.Acc\.Acc\.Str\_EMTraining ablationAMATA7B\\text\{AMATA\}\_\{\\text\{ 7B\}\}75\.8372\.4747\.3949\.10w/oℒIntra\\mathcal\{L\}\_\{\\mathrm\{Intra\}\}72\.9170\.0544\.3245\.28w/oℒInter\\mathcal\{L\}\_\{\\mathrm\{Inter\}\}70\.8667\.5741\.1342\.94w/oℒ𝒯\\mathcal\{L\}\_\{\\mathcal\{T\}\}72\.0369\.7843\.2444\.62w/o𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}73\.3870\.5045\.4146\.95w/o𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}72\.6669\.9844\.2745\.12w/o𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}73\.1369\.3445\.2045\.85Inference ablationw/o𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}73\.5770\.8345\.7547\.26w/o𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}72\.9970\.1544\.8945\.37w/o𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}73\.3670\.2945\.4147\.50Table 3:Training and inference ablation of key trajectory learning modules and agents in*AMATA*\.
### 4\.2Ablation Study

We perform ablation studies on the critical modules involved in the training and inference processes of*AMATA*\. As shown in Table[3](https://arxiv.org/html/2605.17352#S4.T3), during training, removingℒInter\\mathcal\{L\}\_\{\\text\{Inter\}\}prevents the modeling of associations between agents \(e\.g\., knowledge agents and verifiers\), resulting in unresolved collaborative conflicts and the steepest decline in model performance\. Additionally, removing the basic trajectory lossℒ​\(𝒯\)\\mathcal\{L\}\(\\mathcal\{T\}\)causes a significant performance drop, as it impairs the semantic modeling of agent trajectories\.

From the perspective of individual agents, removing𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}fails to effectively reduce noise in the data retrieved by𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}, leading to a marked decline in performance\. Removing the verifier agent𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}eliminates the verification of generated trajectory answers, which may cause hallucinated outputs and subsequently degrade results\. Due to variability in question content,𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}is essential for enabling semantic understanding of user queries; thus, its removal also adversely affects overall performance\.

### 4\.3Detailed Analysis

In this section, we conduct an in\-depth analysis of the adaptive cooperation among*AMATA*agents, elucidating its advantages in both performance and efficiency\. Due to space limitations,computational cost comparison,LLM backbones,hyperparameter analysis, andcase studiesare presented in Appendix[B](https://arxiv.org/html/2605.17352#A2)\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x4.png)Figure 4:Agent dependency analysis across preference methods\. “DA\-DPO”, “FDPO”, “GL”, and “RL” refer to our DA\-DPO, full\-order DPO, global\-local trajectory, and reinforcement learning, respectively\.Agent Dependency Analysis\.We adopt different DPO methods to investigate whether comparing winning and losing examples can enhance dependencies among agents\. Specifically, we compare our DA\-DPO method with DPORafailovet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib220)\)and full\-order DPO \(FDPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib220)\)\. The DPO method utilizes only a single pair of winning and losing QA samples, while FDPO leverages the same number of samples as our method but ranks them using full\-order learning derived from the magnitude of preference scores\. Additionally, we compare our DA\-DPO with global\-local \(SMART\) and RL\-based \(GiGPO\) approaches\.

In Figure[4](https://arxiv.org/html/2605.17352#S4.F4), our DA\-DPO method consistently outperforms the other DPO\-based approaches\. This improvement is attributed to its fine\-grained preference learning for trajectories in multi\-agent collaboration, guided by preference scores\. Subtle differences in scores for winning examples effectively reflect strong correlations between agents, while the inclusion of losing samples and winning examples with weaker scores helps distinguish less dependent relationships\. Methods that do not explicitly account for pairwise preference data or that rely on FDPO\-style comprehensive sorting tend to reduce overall model effectiveness\. Regarding the global\-local trajectory and RL\-based approaches, their results exhibit greater consistency compared to DPO and FDPO\. We hypothesize that this improvement arises because global\-local fusion and RL step\-level feedback enhance supervision of fine\-grained agent dependenciesDuet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib221)\)\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x5.png)Figure 5:Average token consumption per question\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x6.png)Figure 6:Performance of LLM\-based trajectory methods relative to trajectory length\.Token Consumption\.In Figure[5](https://arxiv.org/html/2605.17352#S4.F5), we compare the number of tokens processed by two base models \(Llama2\-7B and Qwen2\.5\-7B\) when solving questions across different multi\-agent frameworks\. We observe that: \(1\) Although knowledge\-augmented methods consume fewer tokens via a single enhancement step, their performance remains modest, as shown in Table[2](https://arxiv.org/html/2605.17352#S3.T2)\. \(2\) Compared with the global\-local trajectory baseline SMARTYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\), although SMART involves fewer agent interactions,*AMATA*not only maintains comparable token overhead but also significantly outperforms it in terms of performance \(\+4\.87%\)\. This improvement stems from our adaptive agent preference learning, which enables*AMATA*to dynamically adjust interactions among knowledge agents \(i\.e\.,𝒜KR,𝒜KF\\mathcal\{A\}\_\{\\text\{KR\}\},\\mathcal\{A\}\_\{\\text\{KF\}\}, and𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}\) and other agents, greatly reducing token consumption\. \(3\) By contrast, RL\-based methods require multiple rollout sessions to compute advantages, resulting in the highest token consumption \(\+70%\)\. Moreover, due to the absence of real\-world step\-level reward signals, these methods introduce noisy data to the knowledge agents, leading to reduced performance \(\-4\.02%\)\.

Generalization by Trajectory Length\.Figure[6](https://arxiv.org/html/2605.17352#S4.F6)compares the performance of LLM\-based trajectory methods as trajectory length increases, i\.e\., as the number of agents in the multi\-agent system grows\. Constructing longer trajectories simulates multi\-agent environments by introducing more agents to evaluate whether retrieved documents are properly filtered and located\. We observe that when the number of agents is relatively small, our adaptive preference learning method and the global\-local trajectory method outperform the RL approach\. We hypothesize that shorter trajectories facilitate simpler and more direct end\-to\-end supervised training\. By contrast, RL\-based methods require the computation of advantage values based on feedback at each step, which can hinder the timely propagation of feedback to local agents\. However, as trajectory length increases, the RL\-based method gradually improves results due to the cumulative effect of reward feedback and iterative rollout learning\. Meanwhile, other methods suffer from excessive error accumulation, resulting in performance degradation\. In this paper, knowledge QA tasks generally involve short trajectories \(fewer than 10 steps\), while GUI tasks typically involve long trajectories \(more than 50 steps\)Yaoet al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib230)\)\.

## 5Conclusion

We propose an adaptive multi\-agent trajectory framework,*AMATA*, which enhances LLMs by effectively incorporating external knowledge to solve knowledge\-intensive QA tasks\. Our key innovations in intra\-trajectory and inter\-agent preference learning enable prioritization of critical agents and accurate modeling of cross\-agent dependencies\. Experiments on diverse knowledge\-intensive QA benchmarks demonstrate the effectiveness and efficiency of our approach\.

## Limitations

Despite the promising results achieved by*AMATA*, our work has several limitations that warrant further investigation\. Due to constraints in computational resources, our experiments were conducted primarily on a 7B\-parameter model\. We anticipate that scaling up to larger models \(e\.g\., 70B parameters or beyond\) could further enhance performance, particularly for more complex knowledge\-intensive tasks\. Additionally, the number of winning and losing samples used in our dependency\-aware DPO was limited toM=N=10M=N=10, which may not fully capture the diversity of inter\-agent dependencies in more heterogeneous task settings\. We set the Top\-KKvalue to 5 for ranking winning examples, a conservative choice that balances efficiency and effectiveness but may overlook finer\-grained preference structures\. Future work will explore larger sample sizes and more adaptive ranking strategies as computational capacity increases\.

## Acknowledgments

This work was supported by the National Natural Science Foundation of China \(Grant No\. 62506110\)\. It was also supported by the Natural Science Foundation of Anhui Province, China \(Grant No\. 2508085QF227\) and the Hefei University of Technology Scientific Research Innovation Start\-up Special Project Type A \(Grant No\. JZ2025HGQA0137\)\.

## References

- M\. Akhtar, O\. Cocarascu, and E\. Simperl \(2022\)PubHealthTab: A public health table\-based dataset for evidence\-based fact checking\.InNAACL,External Links:[Link](https://aclanthology.org/2022.findings-naacl.1/)Cited by:[1st item](https://arxiv.org/html/2605.17352#A1.I2.i1.p1.1),[§1](https://arxiv.org/html/2605.17352#S1.p7.1)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InICLR,External Links:[Link](https://openreview.net/forum?id=hSyW5go0v8)Cited by:[3rd item](https://arxiv.org/html/2605.17352#A1.I2.i3.p1.1),[5th item](https://arxiv.org/html/2605.17352#A1.I4.i5.p1.1),[§A\.1\.1](https://arxiv.org/html/2605.17352#A1.SS1.SSS1.p1.10),[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p1.1),[§A\.4](https://arxiv.org/html/2605.17352#A1.SS4.p3.3),[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- R\. A\. Bradley and M\. E\. Terry \(1952\)Rank analysis of incomplete block designs: i\. the method of paired comparisons\.Biometrika39\(3/4\),pp\. 324–345\.External Links:ISSN 00063444, 14643510,[Link](http://www.jstor.org/stable/2334029)Cited by:[§3\.3](https://arxiv.org/html/2605.17352#S3.SS3.p2.3)\.
- B\. Chen, C\. Shu, E\. Shareghi, N\. Collier, K\. Narasimhan, and S\. Yao \(2023\)FireAct: toward language agent fine\-tuning\.CoRRabs/2310\.05915\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.05915)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- J\. Chen, B\. Lin, R\. Xu, Z\. Chai, X\. Liang, and K\. K\. Wong \(2024\)MapGPT: map\-guided prompting with adaptive path planning for vision\-and\-language navigation\.InACL,pp\. 9796–9810\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.529)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- M\. Cheng, Y\. Luo, J\. Ouyang, Q\. Liu, H\. Liu, L\. Li, S\. Yu, B\. Zhang, J\. Cao, J\. Ma, D\. Wang, and E\. Chen \(2025\)A survey on knowledge\-oriented retrieval\-augmented generation\.CoRRabs/2503\.10677\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.10677)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the AI2 reasoning challenge\.CoRRabs/1803\.05457\.External Links:[Link](http://arxiv.org/abs/1803.05457)Cited by:[2nd item](https://arxiv.org/html/2605.17352#A1.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.17352#S1.p7.1)\.
- H\. Du, S\. Li, M\. Wu, X\. Feng, Y\. Li, and H\. Wang \(2024\)Rewarding what matters: step\-by\-step reinforcement learning for task\-oriented dialogue\.InEMNLP,pp\. 8030–8046\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-emnlp.472)Cited by:[§4\.3](https://arxiv.org/html/2605.17352#S4.SS3.p3.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Rozière, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. M\. Kloumann, I\. Misra, I\. Evtimov, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, and et al\. \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783)Cited by:[§B\.3](https://arxiv.org/html/2605.17352#A2.SS3.p1.1)\.
- W\. Fan, Y\. Ding, L\. Ning, S\. Wang, H\. Li, D\. Yin, T\. Chua, and Q\. Li \(2024\)A survey on RAG meeting llms: towards retrieval\-augmented large language models\.InKDD,pp\. 6491–6501\.External Links:[Link](https://doi.org/10.1145/3637528.3671470)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- F\. Fang, Y\. Bai, S\. Ni, M\. Yang, X\. Chen, and R\. Xu \(2024\)Enhancing noise robustness of retrieval\-augmented language models with adaptive adversarial training\.InACL,pp\. 10028–10039\.External Links:[Link](https://aclanthology.org/2024.acl-long.540/)Cited by:[§4\.1](https://arxiv.org/html/2605.17352#S4.SS1.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for LLM agent training\.CoRRabs/2505\.10978\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.10978)Cited by:[3rd item](https://arxiv.org/html/2605.17352#A1.I5.i3.p1.1),[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p2.1),[§4\.1](https://arxiv.org/html/2605.17352#S4.SS1.p1.1)\.
- P\. Gao, J\. Zhao, X\. Chen, and Y\. Long \(2025\)An efficient context\-dependent memory framework for llm\-centric agents\.InNAACL,pp\. 1055–1069\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-industry.80)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p6.1),[§3\.3](https://arxiv.org/html/2605.17352#S3.SS3.p1.3)\.
- T\. Gao, H\. Yen, J\. Yu, and D\. Chen \(2023\)Enabling large language models to generate text with citations\.InEMNLP,pp\. 6465–6488\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.398)Cited by:[4th item](https://arxiv.org/html/2605.17352#A1.I2.i4.p1.1),[2nd item](https://arxiv.org/html/2605.17352#A1.I4.i2.p1.1),[3rd item](https://arxiv.org/html/2605.17352#A1.I4.i3.p1.1),[§1](https://arxiv.org/html/2605.17352#S1.p7.1)\.
- \[15\]\(OpenAI\. 2023\)Gpt\-4 technical report\.InOpenAI,External Links:[Link](https://openai.com/blog/chatgpt)Cited by:[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p2.1),[§B\.3](https://arxiv.org/html/2605.17352#A2.SS3.p1.1)\.
- X\. Hou, M\. Yang, W\. Jiao, X\. Wang, Z\. Tu, and W\. X\. Zhao \(2024\)CoAct: A global\-local hierarchy for autonomous agent collaboration\.CoRRabs/2406\.13381\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.13381)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p1.1)\.
- Y\. Hu, C\. Chen, C\. H\. Yang, R\. Li, D\. Zhang, Z\. Chen, and E\. Chng \(2024\)GenTranslate: large language models are generative multilingual speech and machine translators\.InACL,pp\. 74–90\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.5)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Inf\. Syst\.43\(2\),pp\. 42:1–42:55\.External Links:[Link](https://doi.org/10.1145/3703155)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- Y\. Ji and S\. Gao \(2024\)Evaluating the effectiveness of large language models in representing and understanding movement trajectories\.CoRRabs/2409\.00335\.External Links:[Link](https://doi.org/10.48550/arXiv.2409.00335)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p6.1)\.
- Z\. Ji, T\. Yu, Y\. Xu, N\. Lee, E\. Ishii, and P\. Fung \(2023\)Towards mitigating LLM hallucination via self reflection\.InEMNLP,pp\. 1827–1843\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.123)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de Las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.CoRRabs/2310\.06825\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.06825)Cited by:[2nd item](https://arxiv.org/html/2605.17352#A1.I3.i2.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. S\. H\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InEMNLP,pp\. 6769–6781\.External Links:[Link](https://doi.org/10.18653/v1/2020.emnlp-main.550)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- T\. Khot, H\. Trivedi, M\. Finlayson, Y\. Fu, K\. Richardson, P\. Clark, and A\. Sabharwal \(2023\)Decomposed prompting: A modular approach for solving complex tasks\.InICLR,External Links:[Link](https://openreview.net/forum?id=%5C_nGgzQjzaRy)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1)\.
- D\. Klisura, A\. R\. B\. Torres, A\. K\. Gárate\-Escamilla, R\. R\. Biswal, K\. Yang, H\. Pataci, and A\. Rios \(2025\)A multi\-agent framework for mitigating dialect biases in privacy policy question\-answering systems\.CoRRabs/2506\.02998\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.02998)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1),[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- B\. Koopman, A\. Mourad, H\. Li, A\. van der Vegt, S\. Zhuang, S\. Gibson, Y\. Dang, D\. Lawrence, and G\. Zuccon \(2024\)AgAsk: an agent to help answer farmer’s questions from scientific documents\.Int\. J\. Digit\. Libr\.25\(4\),pp\. 569–584\.External Links:[Link](https://doi.org/10.1007/s00799-023-00369-y)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1),[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- T\. Kwon, N\. D\. Palo, and E\. Johns \(2024\)Language models as zero\-shot trajectory generators\.IEEE Robotics Autom\. Lett\.9\(7\),pp\. 6728–6735\.External Links:[Link](https://doi.org/10.1109/LRA.2024.3410155)Cited by:[§3\.1](https://arxiv.org/html/2605.17352#S3.SS1.p1.17)\.
- D\. Li, J\. Yan, T\. Zhang, C\. Wang, X\. He, L\. Huang, H\. Xue, and J\. Huang \(2024\)On the role of long\-tail knowledge in retrieval augmented large language models\.InACL,pp\. 120–126\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-short.12),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-SHORT.12)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- X\. Li \(2025\)A review of prominent paradigms for llm\-based agents: tool use, planning \(including rag\), and feedback learning\.InCOLING,pp\. 9760–9779\.External Links:[Link](https://aclanthology.org/2025.coling-main.652/)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- X\. V\. Lin, X\. Chen, M\. Chen, W\. Shi, M\. Lomeli, R\. James, P\. Rodriguez, J\. Kahn, G\. Szilvasy, M\. Lewis, L\. Zettlemoyer, and W\. Yih \(2024\)RA\-DIT: retrieval\-augmented dual instruction tuning\.InICLR,External Links:[Link](https://openreview.net/forum?id=22OTbutug9)Cited by:[4th item](https://arxiv.org/html/2605.17352#A1.I4.i4.p1.1),[§4\.1](https://arxiv.org/html/2605.17352#S4.SS1.p1.1)\.
- Y\. Long, B\. Hui, F\. Ye, Y\. Li, Z\. Han, C\. Yuan, Y\. Li, and X\. Wang \(2023\)SPRING: situated conversation agent pretrained with multimodal questions from incremental layout graph\.InAAAI,pp\. 13309–13317\.External Links:[Link](https://doi.org/10.1609/aaai.v37i11.26562)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1),[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, H\. Hajishirzi, and D\. Khashabi \(2022\)When not to trust language models: investigating effectiveness and limitations of parametric and non\-parametric memories\.CoRRabs/2212\.10511\.External Links:[Link](https://doi.org/10.48550/arXiv.2212.10511)Cited by:[3rd item](https://arxiv.org/html/2605.17352#A1.I2.i3.p1.1),[§1](https://arxiv.org/html/2605.17352#S1.p7.1)\.
- S\. Meng, Y\. Wang, C\. Yang, N\. Peng, and K\. Chang \(2024\)LLM\-a\*: large language model enhanced incremental heuristic search on path planning\.InEMNLP,pp\. 1087–1102\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-emnlp.60)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- R\. L\. Plackett \(1975\)The analysis of permutations\.Journal of the Royal Statistical Society\. Series C \(Applied Statistics\)24\(2\),pp\. 193–202\.External Links:ISSN 00359254, 14679876,[Link](http://www.jstor.org/stable/2346567)Cited by:[§3\.3](https://arxiv.org/html/2605.17352#S3.SS3.p4.9)\.
- C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, Y\. Dang, Z\. Du, W\. Chen, C\. Yang, Z\. Liu, and M\. Sun \(2025\)Scaling large\-language\-model\-based multi\-agent collaboration\.InICLR,External Links:[Link](https://openreview.net/forum?id=K3n5jPkrU6)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- S\. Qiao, Z\. Qiu, B\. Ren, X\. Wang, X\. Ru, N\. Zhang, X\. Chen, Y\. Jiang, P\. Xie, F\. Huang, and H\. Chen \(2025\)Agentic knowledgeable self\-awareness\.InACL,pp\. 12601–12625\.Cited by:[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p1.1)\.
- Qwen Team \(2024\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[2nd item](https://arxiv.org/html/2605.17352#A1.I3.i2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InNeurIPS,External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html)Cited by:[§3\.3](https://arxiv.org/html/2605.17352#S3.SS3.p1.3),[§4\.3](https://arxiv.org/html/2605.17352#S4.SS3.p2.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100, 000\+ questions for machine comprehension of text\.InEMNLP,pp\. 2383–2392\.External Links:[Link](https://doi.org/10.18653/v1/d16-1264)Cited by:[3rd item](https://arxiv.org/html/2605.17352#A1.I2.i3.p1.1),[§1](https://arxiv.org/html/2605.17352#S1.p7.1)\.
- J\. Rasley, S\. Rajbhandari, O\. Ruwase, and Y\. He \(2020\)DeepSpeed\.InKDD,External Links:[Link](http://dx.doi.org/10.1145/3394486.3406703),[Document](https://dx.doi.org/10.1145/3394486.3406703)Cited by:[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p1.1)\.
- O\. Rubin, J\. Herzig, and J\. Berant \(2022\)Learning to retrieve prompts for in\-context learning\.InNAACL,pp\. 2655–2671\.External Links:[Link](https://doi.org/10.18653/v1/2022.naacl-main.191)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p2.1)\.
- W\. Shi, S\. Min, M\. Yasunaga, M\. Seo, R\. James, M\. Lewis, L\. Zettlemoyer, and W\. Yih \(2024\)REPLUG: retrieval\-augmented black\-box language models\.InNAACL,pp\. 8371–8384\.External Links:[Link](https://doi.org/10.18653/v1/2024.naacl-long.463)Cited by:[1st item](https://arxiv.org/html/2605.17352#A1.I4.i1.p1.1)\.
- A\. Singh, A\. Ehtesham, S\. Kumar, and T\. T\. Khoei \(2025\)Agentic retrieval\-augmented generation: A survey on agentic RAG\.CoRRabs/2501\.09136\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.09136)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- Y\. Song, D\. Yin, X\. Yue, J\. Huang, S\. Li, and B\. Y\. Lin \(2024\)Trial and error: exploration\-based trajectory optimization for LLM agents\.CoRRabs/2403\.02502\.External Links:[Link](https://doi.org/10.48550/arXiv.2403.02502)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- A\. Subramonian, X\. Yuan, H\. D\. III, and S\. L\. Blodgett \(2023\)It takes two to tango: navigating conceptualizations of NLP tasks and measurements of performance\.InACL,pp\. 3234–3279\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-acl.202)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- C\. Sun, S\. Huang, and D\. Pompili \(2024\)LLM\-based multi\-agent reinforcement learning: current and future directions\.CoRRabs/2405\.11106\.External Links:[Link](https://doi.org/10.48550/arXiv.2405.11106)Cited by:[§B\.3](https://arxiv.org/html/2605.17352#A2.SS3.p1.1)\.
- X\. Tang, M\. Kan, S\. Shan, and X\. Chen \(2025\)Plan\-r1: safe and feasible trajectory planning as language modeling\.CoRRabs/2505\.17659\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.17659)Cited by:[§3\.1](https://arxiv.org/html/2605.17352#S3.SS1.p1.17)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample \(2023\)LLaMA: open and efficient foundation language models\.CoRRabs/2302\.13971\.External Links:[Link](https://doi.org/10.48550/arXiv.2302.13971),2302\.13971Cited by:[1st item](https://arxiv.org/html/2605.17352#A1.I3.i1.p1.1),[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p1.1)\.
- H\. Wang, C\. T\. Leong, J\. Wang, J\. Wang, and W\. Li \(2025a\)SPA\-RL: reinforcing LLM agents via stepwise progress attribution\.CoRRabs/2505\.20732\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.20732)Cited by:[4th item](https://arxiv.org/html/2605.17352#A1.I5.i4.p1.1),[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p2.1),[§4\.1](https://arxiv.org/html/2605.17352#S4.SS1.p1.1)\.
- Y\. Wang, R\. Ren, Y\. Wang, W\. X\. Zhao, J\. Liu, H\. Wu, and H\. Wang \(2025b\)Unveiling knowledge utilization mechanisms in llm\-based retrieval\-augmented generation\.CoRRabs/2505\.11995\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.11995)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- S\. Woo, K\. Zhou, Y\. Zhou, S\. Wang, S\. Guan, H\. Ding, and L\. L\. Cheong \(2025\)Black\-box visual prompt engineering for mitigating object hallucination in large vision language models\.InNAACL,pp\. 529–538\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-short.45)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p1.1)\.
- W\. Xiong, Y\. Song, X\. Zhao, W\. Wu, X\. Wang, K\. Wang, C\. Li, W\. Peng, and S\. Li \(2024\)Watch every step\! LLM agent learning via iterative step\-level process refinement\.InEMNLP,pp\. 1556–1572\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.93)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- H\. Xu, X\. Mao, P\. Yang, F\. Sun, and H\. Huang \(2024a\)Rethinking task\-oriented dialogue systems: from complex modularity to zero\-shot autonomous agent\.InACL,pp\. 2748–2763\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.152)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p2.1)\.
- S\. Xu, L\. Pang, M\. Yu, F\. Meng, H\. Shen, X\. Cheng, and J\. Zhou \(2024b\)Unsupervised information refinement training of large language models for retrieval\-augmented generation\.InACL,pp\. 133–145\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.9)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p2.1)\.
- C\. Yang, C\. Shi, S\. Li, B\. Shui, Y\. Yang, and W\. Lam \(2025\)LLM2: let large language models harness system 2 reasoning\.InNAACL,pp\. 168–177\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-short.15)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InNeurIPS,External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/82ad13ec01f9fe44c01cb91814fd7b8c-Abstract-Conference.html)Cited by:[§4\.3](https://arxiv.org/html/2605.17352#S4.SS3.p5.1)\.
- F\. Ye, S\. Li, Y\. Zhang, and L\. Chen \(2024\)R2ag: incorporating retrieval information into retrieval augmented generation\.InEMNLP,pp\. 11584–11596\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-emnlp.678)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- S\. Yue, S\. Wang, W\. Chen, X\. Huang, and Z\. Wei \(2025\)Synergistic multi\-agent framework with trajectory learning for knowledge\-intensive tasks\.InAAAI,pp\. 25796–25804\.External Links:[Link](https://doi.org/10.1609/aaai.v39i24.34772)Cited by:[2nd item](https://arxiv.org/html/2605.17352#A1.I5.i2.p1.1),[§A\.1\.1](https://arxiv.org/html/2605.17352#A1.SS1.SSS1.p1.10),[§A\.3\.1](https://arxiv.org/html/2605.17352#A1.SS3.SSS1.p1.1),[§A\.3\.2](https://arxiv.org/html/2605.17352#A1.SS3.SSS2.p1.14),[§A\.4](https://arxiv.org/html/2605.17352#A1.SS4.p1.1),[§1](https://arxiv.org/html/2605.17352#S1.p2.1),[§1](https://arxiv.org/html/2605.17352#S1.p3.1),[§2](https://arxiv.org/html/2605.17352#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.17352#S3.SS1.p1.17),[§4\.1](https://arxiv.org/html/2605.17352#S4.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.17352#S4.SS3.p4.2)\.
- Z\. Yue, H\. Zeng, L\. Shang, Y\. Liu, Y\. Zhang, and D\. Wang \(2024\)Retrieval augmented fact verification by synthesizing contrastive arguments\.InACL,pp\. 10331–10343\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.556)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1)\.
- A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang \(2024\)AgentTuning: enabling generalized agent abilities for llms\.InACL,pp\. 3053–3077\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-acl.181)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- G\. Zhang, Y\. Yue, Z\. Li, S\. Yun, G\. Wan, K\. Wang, D\. Cheng, J\. X\. Yu, and T\. Chen \(2025a\)Cut the crap: an economical communication pipeline for llm\-based multi\-agent systems\.InICLR,External Links:[Link](https://openreview.net/forum?id=LkzuPorQ5L)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1)\.
- S\. Zhang, M\. Yin, J\. Zhang, J\. Liu, Z\. Han, J\. Zhang, B\. Li, C\. Wang, H\. Wang, Y\. Chen, and Q\. Wu \(2025b\)Which agent causes task failures and when? on automated failure attribution of LLM multi\-agent systems\.CoRRabs/2505\.00212\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.00212)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1),[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- T\. Zhang, D\. Li, Q\. Chen, C\. Wang, L\. Huang, H\. Xue, X\. He, and J\. Huang \(2024\)R4\{\}^\{\\mbox\{4\}\}: reinforced retriever\-reorder\-responder for retrieval\-augmented large language models\.In27th European Conference on Artificial Intelligence,Frontiers in Artificial Intelligence and Applications,pp\. 2314–2321\.External Links:[Link](https://doi.org/10.3233/FAIA240755),[Document](https://dx.doi.org/10.3233/FAIA240755)Cited by:[§2](https://arxiv.org/html/2605.17352#S2.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InNeurIPS,External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[2nd item](https://arxiv.org/html/2605.17352#A1.I3.i2.p1.1)\.
- K\. Zhu, X\. Feng, X\. Du, Y\. Gu, W\. Yu, H\. Wang, Q\. Chen, Z\. Chu, J\. Chen, and B\. Qin \(2024\)An information bottleneck perspective for effective noise filtering on retrieval\-augmented generation\.InACL,pp\. 1044–1069\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.59)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p2.1)\.
- C\. Zong, Y\. Yan, W\. Lu, J\. Shao, Y\. Huang, H\. Chang, and Y\. Zhuang \(2024\)Triad: A framework leveraging a multi\-role llm\-based agent to solve knowledge base question answering\.InEMNLP,pp\. 1698–1710\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.101)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p3.1),[§2](https://arxiv.org/html/2605.17352#S2.p1.1)\.
- L\. Zou, Q\. Wang, H\. Zhao, J\. Jiangangkong, Y\. Yang, and Y\. Deng \(2024\)CQIL: inference latency optimization with concurrent computation of quasi\-independent layers\.InACL,pp\. 7293–7307\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.394)Cited by:[§1](https://arxiv.org/html/2605.17352#S1.p2.1)\.

## Appendix ADetailed Experimental Settings

### A\.1Datasets and Evaluation Metrics

#### A\.1\.1Trajectory Training Set Construction

Our trajectory data are collected from open\-source long\-trajectory datasets provided by Self\-RAGAsaiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib248)\)and SMARTYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\), which together include 140,000 well\-designed instances\.222The open\-source trajectory data are available at[https://huggingface\.co/datasets/ShengbinYue/Long\-short\-Trajectory](https://huggingface.co/datasets/ShengbinYue/Long-short-Trajectory)\.Trajectory data for two additional agents are derived from the previously constructed basic trajectories\. Figure[8](https://arxiv.org/html/2605.17352#A1.F8)illustrates the collection process for⟨\\langleFilter⟩\\rangletrajectory data, which is guided by the⟨\\langleRetriever⟩\\ranglestep and subject to subsequent⟨\\langleLocator⟩\\rangleconstraints\. Data collection for the⟨\\langleVerifier⟩\\rangletrajectory is performed after the⟨\\langleGenerator⟩\\ranglestep, and aims to verify answer robustness as shown in Figure[9](https://arxiv.org/html/2605.17352#A1.F9)\.

Agent RolesHigh Score \(4–5\) ConditionLow Score \(0–1\) ConditionRetriever/Filter/LocatorQuestion requires external,long\-tail, or specific factualknowledge\.Question can be answered withthe LLM’s parametricknowledge alone\.VerifierThe answer is complex, potentiallyambiguous, or requires high factualprecision\.The answer is straightforwardor the confidence from previoussteps is very high\.ReconstructorThe user instruction is complex,multi\-hop, or requires semanticparsing\.The question is alreadysimple and well\-structured\.GeneratorAlways required \(baseline score of 5\)\.N/ATable 4:Detailed rubric for different agent roles in preference scoring\.GroupKnowledge Agent ScoresPopQA AccuracyGroup 11\.222\.5%Group 23\.141\.8%Group 34\.263\.4%Table 5:Correlation analysis of the reasonableness of LLM\-generated preferences\. “Group 1”, “Group 2”, and “Group 3” correspond to low, medium, and high knowledge requirements, respectively\.
#### A\.1\.2Trajectory Scores

As shown in Figure[10](https://arxiv.org/html/2605.17352#A1.F10), intra\-trajectory scores are computed based on QA pairs and two demonstration examples\. Additionally, inter\-trajectory score data are annotated using demonstration examples with varying task instructions, as depicted in Figure[11](https://arxiv.org/html/2605.17352#A1.F11)\. A comprehensive training example is provided in Figure[12](https://arxiv.org/html/2605.17352#A1.F12)\. Inter\-dependency preference data are sorted based on the sum of scores across trajectories\.

There are two primary types of agents involved in scoring: knowledge agents, and generator/verifier agents\. Knowledge agents are essential when multi\-agent systems require external knowledge to solve complex questions\. Generator/verifier agents are associated with the confidence level of LLMs during answer generation\.

To better elucidate the rationality of our scoring process, we provide a detailed rubric for agent preference scoring in Table[4](https://arxiv.org/html/2605.17352#A1.T4)\. For example, for the question “Which American actor played fraternity president “Lewis Skol\.” in the “Revenge of the Nerds” comedy films?” \(see Figure[3](https://arxiv.org/html/2605.17352#S3.F3)\), the LLM annotator correctly assigns high scores to the knowledge agents \(Retriever=5,Filter=4,Locator=4\) as it identifies the need for specific external knowledge, and assigns a low score to the Verifier due to high confidence in the retrieved evidence\.

We verify the reasonableness of these scores through two approaches: ablation studies and correlation analysis\.

- •Ablation Study as Direct Evidence:The most direct validation is to observe performance changes when removing an agent deemed important according to the score\. Our ablation study \(Table[3](https://arxiv.org/html/2605.17352#S4.T3)\) provides strong evidence that the LLM\-assigned preferences align with the agents’ actual functional importance\. For example, on PopQA \(a long\-tail knowledge dataset\), removing𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}\(typically high\-scored\) results in a∼\\sim3\.1% performance drop, the largest among single\-agent ablations\. This confirms that the LLM annotator correctly identifies the critical need for retrieval in such tasks\. Conversely, removing𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}has a smaller but still notable impact, particularly on factuality\-focused tasks like ARC\-C and ASQA, justifying its medium\-to\-low but non\-zero scores\.
- •Correlation Analysis:To further support our claim, we conduct a correlation analysis on a sample of 200 questions from the PopQA dataset, comparing the average preference score assigned to the three knowledge agents \(Retriever, Filter, Locator\) against final task accuracy\. As shown in Table[5](https://arxiv.org/html/2605.17352#A1.T5), the strong positive correlation demonstrates that higher LLM\-assigned preference scores for knowledge agents correspond to significantly improved accuracy, thus quantitatively confirming the reasonableness of the scores\.
- •DPO Score Examples:To aid the LLM in understanding our scoring process, we provide two examples as demonstrations in the API input prompts, shown in Figure[7](https://arxiv.org/html/2605.17352#A1.F7)\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x7.png)Figure 7:Examples of different preference scores for⟨\\langleRetriever⟩\\rangle\.In summary, while annotation is performed by an LLM, it is grounded in a structured, semantically meaningful rubric\. More importantly, its effectiveness is empirically confirmed through ablation studies and correlation analysis, demonstrating that the learned preferences contribute directly to the framework’s performance and efficiency\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x8.png)Figure 8:Collection of “Filter” trajectory data\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x9.png)Figure 9:Collection of “Verifier” trajectory data\.
#### A\.1\.3Evaluation Datasets and Metrics

- •Fact Verification:PubHealth \(also referred to as HealthQA\)Akhtaret al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib227)\)is a public health fact\-checking dataset\. Model performance is evaluated by accuracy \(Acc\.\) on its test set of 987 samples labeled “True” or “False”\.
- •Multiple\-choice QA:ARC\-ChallengeClarket al\.\([2018](https://arxiv.org/html/2605.17352#bib.bib226)\)consists of 1,172 multiple\-choice science exam questions\. Performance is measured by accuracy \(Acc\.\)\.
- •Open\-domain QA:\(1\) PopQAMallenet al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib225)\)contains 1,399 long\-tail, rare\-entity queries from Wikipedia\. \(2\) SQuADRajpurkaret al\.\([2016](https://arxiv.org/html/2605.17352#bib.bib1)\)includes 8,886 queries written by annotators based on documents\. Following prior workAsaiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib248)\), performance is evaluated using exact match \(EM\)\.
- •Ambiguous QA:ASQAGaoet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib224)\)features 4,132 ambiguous factual questions requiring long\-form responses\. Fluency is assessed using Mauve, and accuracy is measured with Str\_EM and Rouge\-L, consistent with official evaluation settings\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x10.png)Figure 10:Intra\-trajectory score collection\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x11.png)Figure 11:Inter\-trajectory score collection\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x12.png)Figure 12:Complete response example from PubHealth\.

### A\.2Baselines

#### A\.2\.1Vanilla QA Methods

LLMs acquire extensive factual knowledge, internalized within their model parameters through large\-scale unsupervised pre\-training\. During both training and inference, we adhere to official prompt formats\.

- •SFT and preference alignment models:GPT\-3\.5\-turbo \(ChatGPT\)333We use the gpt\-3\.5\-turbo\-0125 version in experiments\., Llama\-2\-Chat\-7BTouvronet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib153)\), and Llama\-2\-Chat\-13BTouvronet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib153)\)\.
- •

#### A\.2\.2Knowledge\-Augmented Methods

We implement standard knowledge augmentation approaches\. When model weights are unavailable, methods are replicated using the same base models and training data\. Uniform retrieval models and knowledge bases ensure experimental fairness\.

- •REPLUGShiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib178)\)uses frozen LLM parameters and augments them with a tunable retrieval model\. In our experiments, the backbone is replaced with Llama\-2\-Chat\-7B for fairness\.
- •VANILLA\-7BGaoet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib224)\)first retrieves relevant passages, then instructs the model to assess document relevance and generate appropriate citations\. The backbone is Llama\-2\-Chat\-7B\.
- •INTERACT\-7BGaoet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib224)\)employs an interactive prompting mechanism that enables the agent to verify retrieved passages through three distinct actions: “Check”, “Output”, and “End”\. The backbone is Llama\-2\-Chat\-7B\.
- •RADIT\-7BLinet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib222)\)introduces retrieval\-augmented dual instruction tuning, a lightweight fine\-tuning framework that retrofits existing LLMs with retrieval capabilities, offering an alternative to conventional methodologies\. For fair comparison, the pre\-trained Llama\-2 model is fine\-tuned on the same dataset used in our experiments\.
- •SelfRAG\-7BAsaiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib248)\)enhances language model quality and factuality through retrieval and self\-reflection with special tokens\.

#### A\.2\.3LLM\-Based Trajectory Methods

Through multi\-agent collaboration, LLMs with distinct task capabilities can be coordinated to form workflows that enhance response reliability\.

- •MMAgent\-3\*7B:Our modular multi\-agent framework, in which each component agent is independently trained on identical datasets while sharing the same pre\-trained Llama\-2 backbone\. The workflow is realized through systematic agent decoupling\.
- •SMARTYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\): A global\-local multi\-agent framework with predefined trajectories\.
- •GiGPOFenget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib36)\): Proposes a hierarchical architecture that evaluates both global trajectory quality and local action effectiveness, while eliminating the need for auxiliary models or additional rollouts\. This paradigm ensures superior scalability for long\-horizon LLM agent training\.
- •SPA\-RLWanget al\.\([2025a](https://arxiv.org/html/2605.17352#bib.bib235)\): Proposes a general reward redistribution framework that systematically decomposes the final reward into stepwise contributions, with each component accurately reflecting its incremental impact on overall task completion\.

### A\.3Experimental Settings

#### A\.3\.1Training Details

Our implementation is initialized with the pre\-trained Llama\-2\-7B foundation modelTouvronet al\.\([2023](https://arxiv.org/html/2605.17352#bib.bib153)\)\. Each agent is initialized as an independent large model, emulating distinct agent abilities through special tokens in both intra\- and inter\-trajectory training\. These abilities are acquired by training a shared modelAsaiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib248)\); Yueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\); Qiaoet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib208)\)\. Training is performed on two NVIDIA A100 GPUs \(80GB each\), utilizing LoRA adaptationHuet al\.\([2022](https://arxiv.org/html/2605.17352#bib.bib22)\)for intra\- and inter\-trajectory learning\. Both trajectory learning phases span three epochs with uniform hyperparameters: batch size of 64, peak learning rate of6×10−46\\times 10^\{\-4\}, and 5% warmup steps\. The maximum sequence length is 1024 tokens for intra\-trajectory and 2048 tokens for inter\-trajectory training, with DeepSpeed Stage\-3Rasleyet al\.\([2020](https://arxiv.org/html/2605.17352#bib.bib21)\)applied to optimize GPU memory utilization\.

Due to the advantage of supervision at intermediate agent training steps, RL\-based trajectory methods such as GiGPOFenget al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib36)\)and SPA\-RLWanget al\.\([2025a](https://arxiv.org/html/2605.17352#bib.bib235)\)outperform simple multi\-agent baselines\. Specifically, we assign intermediate steps in the GiGPO trajectory to the six multi\-agent actions we define\. The final reward signal for correctly predicted answers is set to 1, and to−1\-1for incorrect answers\. GiGPO obtains intermediate reward signals by implicitly learning episode\-relative advantages and thus does not require manual specification in the task dataset\. SPA\-RL, however, necessitates explicit process supervision signals at intermediate stepsWanget al\.\([2025a](https://arxiv.org/html/2605.17352#bib.bib235)\)\. Here, we utilize GPT\-4o[15](https://arxiv.org/html/2605.17352#bib.bib121)to score each step, providing intermediate human feedback as accurate as possible\. The final reward setting matches that used for GiGPO\.

#### A\.3\.2Evaluation Details

For the two additional agents—knowledge filter \(⟨\\langleFilter⟩\\rangle\) and verifier \(⟨\\langleVerifier⟩\\rangle\)—if⟨\\langleFilter⟩\\rangleoutputs retrieved document indices not present in⟨\\langleRetriever⟩\\rangle, we remove them\. If all indices are filtered out, we retain the first one\. If the output trajectory includes⟨\\langleVerifier⟩\\rangle, the answer requires further verification\. If the⟨\\langleVerifier⟩\\rangleoutput is “wrong”, we incorporate the error signal into the instruction via the prompt and re\-execute the trajectory for LLM reflection\. Otherwise, we extract the answer from⟨\\langleGenerator⟩\\rangle\. Other settings, including evaluation task instructions, follow those of the SMART modelYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\)\.

### A\.4Agent Description

We extend the SMARTYueet al\.\([2025](https://arxiv.org/html/2605.17352#bib.bib242)\)framework by adding two essential agents to further enhance the reliability of LLM responses: the Knowledge Filter and Answer Verifier\.

- •Intent Reconstructor:Agent𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}elucidates user question intent\. To process diverse instructions into well\-formatted intents, the agent employs four key capabilities: \(1\) integrating contextual clues, \(2\) identifying key queries, \(3\) unifying task formulation, and \(4\) decomposing intent\.
- •Knowledge Retriever:Given the reconstructed question, agent𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}retrieves supplementary knowledge from an external knowledge base \(e\.g\., Wikipedia\)\. For simple questions,*AMATA*may skip intermediate knowledge agents \(i\.e\.,𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\},𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}, and𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}\) and directly invoke the response generator𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}\.
- •Knowledge Filter:To eliminate redundant information from retrieved documents, agent𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}extracts the most accurate background knowledge\. Concurrently,𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}empowers the locator agent𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}to constrain the search scope and perform fine\-grained identification of knowledge supportive to the generator𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}\.
- •Knowledge Locator:Operating on the refined document set from𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}, agent𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}performs granular localization to identify and extract knowledge segments most conducive to response generation by𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}\.
- •Response Generator:Agent𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}synthesizes responses in two modes: when knowledge segments are supplied by𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}, it generates answers constrained by localized evidence with explicit source attribution; otherwise, it relies solely on parametric knowledge\. The generator maintains provenance transparency by clearly differentiating evidence\-based and intrinsic knowledge sources\.
- •Answer Verifier:Agent𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}performs self\-correction by re\-examining the generated response against relevant knowledge sources, identifying inaccuracies through evidence\-based verification, and applying targeted revisions to enhance factual consistency and logical coherence\.

We further detail the steps for the relevant agents \(i\.e\.,𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\},𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}, and𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}\) to clarify how specific agent trajectory data are constructed\. For other agents’ data collection steps, refer to Self\-RAGAsaiet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib248)\)\.

- •Intent Reconstructor:Within multi\-turn dialogues, this agent models dependencies to capture long\-term intent\. When processing noisy instructions, it eliminates extraneous content to isolate essential questions\. For diverse task formats \(e\.g\., multiple\-choice QA\), the agent standardizes inputs into a cohesive query representation\. For complex multi\-hop questions \(e\.g\.,‘‘Who was born earlier, person A or person B?’’\), it decomposes them into atomic intents, such as retrieving each individual’s birthdate\. By flexibly applying these capabilities, the agent derives a well\-structured query intent, facilitating external knowledge retrieval\.
- •Knowledge Filter:Prompt\-induced hallucinations may cause the filter𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}to generate documents and indices not present in the retriever𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}\. To address this, we implement rigorous content verification procedures to ensure filtered content remains consistent with original sources\. Additionally, if filtering removes all documents, the top\-ranked document based on the retriever’s scores is retained\.
- •Answer Verifier:For responses marked “correct” by agent𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}, the reasoning process for that question is terminated\. Conversely, for responses marked “wrong,” error signals are incorporated into the instruction𝒬\\mathcal\{Q\}via concatenation\. This encourages the multi\-agent trajectory to adopt a reasoning mode that considers potential errors, prompting deeper reflection and improving the model’s reasoning accuracy\.

## Appendix BAdditional Experimental Results

![Refer to caption](https://arxiv.org/html/2605.17352v1/x13.png)Figure 13:Comparison of training data size and computational cost for various baselines and our*AMATA*model\.### B\.1Data Size and Computational Cost Comparison

In Figure[13](https://arxiv.org/html/2605.17352#A2.F13), we present a detailed comparison of the training data size and computational cost for*AMATA*and its key baselines\. All experiments are performed using identical hardware settings \(2×\\timesNVIDIA A100 80GB GPUs\) to ensure fairness\.

We observe the following: \(1\)Data Efficiency:The total data consumption of*AMATA*\(∼\\sim65,000 samples\) is substantially lower than that of SMART \(∼\\sim500,000 samples\) and is comparable to SelfRAG\. Most importantly, the proposed DA\-DPO stage is highly data\-efficient, requiring only∼\\sim5,000 expert\-ranked samples to effectively learn complex inter\-agent dependencies\. This is a fraction of the data used by other training stages and baselines\.

\(2\)Computational and Training Time Efficiency:\[1\] Compared to SMART,*AMATA*requires less than half the training time \(∼\\sim20 hours vs\.∼\\sim45 hours\), mainly because pre\-training on a massive, separate short\-trajectory dataset is unnecessary\. Our intra\-trajectory learning consolidates this into a single, more efficient stage\. \[2\] Compared to RL methods \(GiGPO\),*AMATA*is 3–4 times faster\. RL training is notoriously slow due to multiple rollouts and per\-step reward computation, whereas our DA\-DPO stage performs stable, offline optimization\. \[3\] The additional cost of DA\-DPO over standard DPO is minimal \(an extra∼\\sim5 hours\), as it utilizes the same computational framework but employs a more sophisticated loss function\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x14.png)Figure 14:Averaged performance of*AMATA*corresponding to different loss coefficientsα1\\alpha\_\{1\}andα2\\alpha\_\{2\}\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x15.png)Figure 15:Impact of selecting theKKvalue for winning and losing samples on model performance\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x16.png)Figure 16:Complete response example from MMAgent\.Task→\\rightarrowHealthQAARC\-CPopQASquad1ASQAAverageModel↓\\downarrowAcc\.Acc\.Acc\.Acc\.Str\_EMRouge\-LMauveLlama\-3\-Ins\.8B64\.39\(±0\.9\)53\.44\(±1\.5\)22\.73\(±1\.2\)16\.24\(±0\.8\)18\.85\(±1\.7\)33\.48\(±1\.5\)37\.65\(±0\.7\)35\.25\(±1\.3\)GPT4o79\.24\(±1\.0\)80\.71\(±1\.8\)30\.94\(±1\.1\)24\.83\(±1\.5\)44\.02\(±0\.9\)36\.92\(±1\.7\)47\.53\(±1\.6\)49\.17\(±1\.1\)RADIT8B55\.29\(±1\.1\)64\.88\(±1\.0\)41\.15\(±1\.6\)24\.97\(±1\.1\)29\.01\(±1\.8\)17\.22\(±1\.2\)13\.86\(±0\.8\)35\.20\(±1\.7\)SelfRag8B70\.89\(±1\.6\)68\.14\(±1\.2\)43\.55\(±1\.5\)26\.06\(±0\.8\)30\.95\(±1\.1\)35\.98\(±1\.2\)87\.12\(±2\.1\)51\.81\(±1\.2\)SMART8B75\.99\(±1\.6\)72\.81\(±0\.6\)47\.66\(±1\.2\)33\.05\(±1\.4\)45\.74\(±1\.6\)44\.95\(±1\.2\)94\.80\(±1\.2\)59\.29\(±0\.8\)SPA\-RL8B76\.88\(±1\.7\)72\.98\(±1\.3\)46\.85\(±1\.3\)34\.51\(±1\.8\)46\.63\(±1\.2\)45\.57\(±0\.7\)94\.93\(±1\.1\)59\.76\(±1\.1\)GiGPO8B77\.20\(±1\.0\)73\.84\(±0\.7\)47\.29\(±1\.7\)34\.05\(±1\.2\)45\.75\(±1\.8\)46\.98\(±1\.0\)95\.62\(±1\.7\)60\.10\(±1\.6\)AMATA8B78\.74\(±1\.2\)74\.11\(±1\.0\)49\.62\(±1\.6\)37\.80\(±1\.2\)50\.83\(±1\.5\)51\.57\(±1\.1\)96\.92\(±1\.8\)62\.80\(±1\.3\)Table 6:Overall results of*AMATA*with other backbone LLMs\.
### B\.2Hyperparameter Sensitivity Analysis

#### B\.2\.1Loss Coefficients

We evaluate the impact of the loss coefficients in the total lossℒtotal\\mathcal\{L\}\_\{\\text\{total\}\}by experimenting with five different pairs of values forα1\\alpha\_\{1\}andα2\\alpha\_\{2\}\. As shown in Figure[14](https://arxiv.org/html/2605.17352#A2.F14), our model achieves optimal performance when the coefficients are balanced \(i\.e\.,α1=0\.5\\alpha\_\{1\}=0\.5andα2=0\.5\\alpha\_\{2\}=0\.5\)\.

#### B\.2\.2Number of Winning and Losing Samples

In Figure[15](https://arxiv.org/html/2605.17352#A2.F15), we analyze the effect of varying the number of winning and losing samples, denotedMMandNN, as well as the top\-KKvalue in our dependency\-aware DPO framework\. We observe that increasingMMandNNcauses model performance to gradually decline, likely due to the introduction of excessive retrieval noise\. Consequently, we set bothMMandNNto 10\. WithM=N=10M=N=10, we further investigate how different values ofKKaffect performance across winning examples\. Our findings indicate that choosingKKas the midpoint ofMMyields optimal results\. Therefore, we setK=5K=5to achieve the best performance reported in Table[2](https://arxiv.org/html/2605.17352#S3.T2)\.

![Refer to caption](https://arxiv.org/html/2605.17352v1/x17.png)Figure 17:Complete response example from SMART\.![Refer to caption](https://arxiv.org/html/2605.17352v1/x18.png)Figure 18:Complete response example from GiGPO\.

### B\.3LLMs’ Backbone Analysis

Our*AMATA*framework decouples the model architecture from backbone selection\. We further evaluate several state\-of\-the\-art LLMs, including Llama\-3Dubeyet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib150)\)and GPT\-4o[15](https://arxiv.org/html/2605.17352#bib.bib121), to validate the effectiveness of our method, using Llama\-3\-Instruct \(8B\) as the backbone\. As shown in Table[6](https://arxiv.org/html/2605.17352#A2.T6), trajectory planning methods based on RL demonstrate even greater potential when enhanced LLM capabilities are available\. We speculate that stronger foundational abilities are unlocked by post\-RL training, which intrinsically provides planning skills for multi\-agent tasksSunet al\.\([2024](https://arxiv.org/html/2605.17352#bib.bib149)\)\. Our*AMATA*consistently exhibits performance improvements as the underlying backbone is strengthened\.

### B\.4Case Study

In this section, we present a case study analyzing our*AMATA*framework alongside other trajectory training paradigms, including*MMAgent*,*SMART*, and*GiGPO*\.

The complete response generated by*AMATA*is shown in Figure[12](https://arxiv.org/html/2605.17352#A1.F12)\. The user instruction pertains to retrieved document index “\[1\]”, enabling𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}and𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}to make accurate inferences based on the document\. Moreover, sufficient external documentation supports the LLM’s inference, resulting in high confidence in the generated answer\. Our answer verifier,𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}, confirms that no further validation is required\.

In contrast, other trajectory learning paradigms face various types of errors\. As shown in Figure[16](https://arxiv.org/html/2605.17352#A2.F16), since*MMAgent*trains each agent independently, internal connections within the trajectory are neglected\. This causes the retriever to fail to fully comprehend the semantics of the user’s question, especially when retrieved documents are labeled “None”\. Consequently, the combined operation of knowledge retriever𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\},𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}, and𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}does not effectively contribute to the LLM’s inference\.

As shown in Figure[17](https://arxiv.org/html/2605.17352#A2.F17), SMART benefits from predefined global\-local trajectory training and produces a generally correct trajectory path\. However, both SMART and GiGPO, as shown in Figures[17](https://arxiv.org/html/2605.17352#A2.F17)and[18](https://arxiv.org/html/2605.17352#A2.F18), fail to account for dependencies between inter\-agent processes\. This oversight introduces external noise into the retrieved documents, resulting in inaccurate overall trajectories\.

## Appendix CInference

Algorithm[1](https://arxiv.org/html/2605.17352#alg1)gives an overview of inference in our*AMATA*framework\. During inference,*AMATA*first analyzes the query to determine whether external knowledge is required\. If not, it directly generates and verifies the answer\. Otherwise, it retrieves and filters relevant documents, then generates a grounded response\. The answer is subsequently verified; if found incorrect, the process iterates with updated instructions to refine the response\. The specific steps are as follows:

- •Step 1:We utilize the intent reconstructor𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}to decompose the user prompt𝒬\\mathcal\{Q\}\. If the initial trajectory head tokenh=hRGh=h\_\{\\text\{RG\}\}indicates that the question can be answered directly by the LLM without external knowledge, the response agent𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}generates the answer𝒴\\mathcal\{Y\}\. Subsequently, the answer verifier checks the consistency of the generated response\.\(Lines 1–13\)
- •Step 2:Whenh=hKRh=h\_\{\\text\{KR\}\}indicates that the question𝒬\\mathcal\{Q\}requires external knowledge, the knowledge retriever agent𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}retrieves external documents\{d1,…,dk⋅m\}\\\{d\_\{1\},\\dots,d\_\{k\\cdot m\}\\\}\. The knowledge filter𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}removes noise from these documents to produce a refined setD=\{d1,…,dw\}D=\\\{d\_\{1\},\\dots,d\_\{w\}\\\}\. The knowledge locator𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}extracts fine\-grained text spansyKLy\_\{\\text\{KL\}\}fromDDbased on𝒬\\mathcal\{Q\}\.\(Lines 16–23\)
- •Step 3:If the extracted span is relevant \(r=\[Relevant\]r=\\text\{\[Relevant\]\}\), the response generator𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}formulates the answer using this span; otherwise, it answers directly\. The answer verifier evaluates the response𝒴\\mathcal\{Y\}for factual correctness\. If incorrect,*AMATA*restarts the trajectory generation at“Start”, incorporating the erroneous response into𝒬\\mathcal\{Q\}for further reasoning\.\(Lines 24–35\)

When the verifier determines that the LLM response is incorrect \(i\.e\.,Lines 12and34\), we incorporate the incorrect answer into the prompt𝒬\\mathcal\{Q\}and reuse the previously generated dynamic trajectory for further inference, denoted as “goto Start\.” To prevent repeated incorrect answer generation and infinite loops, a maximum of three iterations is allowed, after which the result is returned as the default answer\.

Additionally, we analyze the out\-of\-domain generalization of our algorithm\. Once trained,*AMATA*operates as a zero\-shot, self\-adaptive system that requires no pre\-existing training traces or score annotations for new domains or questions\.

Generalization to Unseen Datasets and Domains\.The core of*AMATA*’s generalization lies in its learning objective: We train it not to memorize specific answers or trajectories but to learn two fundamental principles:

- •Intra\-Trajectory Preference:Dynamically assess which agents are important for a given question based on semantic content\.
- •Inter\-Agent Dependency:Orchestrate selected agents in a coherent and efficient sequence\.

Once this “collaboration policy” is learned, it can be applied to new questions\. The model analyzes each new, unseen question at inference and uses its acquired knowledge to:

- •Dynamically predict the relevance of each agent \(simulating internal scoring\)\.
- •Execute an optimal trajectory of agent invocations based on these predictions and learned dependencies\.

Applicability without Training Traces\.Our model incurs a one\-time cost for bootstrapping collaborative reasoning capability\. The trained framework is fully self\-sufficient at inference\.

Algorithm 1Inference of*AMATA*Require:Intent Reconstructor𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}, Knowledge Retriever𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}, Knowledge Filter𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}, Knowledge Locator𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}, Response Generator𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}, Answer Verifier𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}, passage collectionsd1,…,dkd\_\{1\},\\dots,d\_\{k\}, trajectory head tokenhh, trajectory end tokenee Input: User prompt𝒬\\mathcal\{Q\} Output: Answer𝒴\\mathcal\{Y\}

1:Start:

𝒜IR\\mathcal\{A\}\_\{\\text\{IR\}\}predicts

q1,…,qm,eIR,hq\_\{1\},\\dots,q\_\{m\},e\_\{\\text\{IR\}\},hgiven

𝒬,hIR\\mathcal\{Q\},h\_\{\\text\{IR\}\}
2:if

h=hRGh=h\_\{\\texttt\{RG\}\}then

3:

𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}predicts

𝒴\\mathcal\{Y\},

eRGe\_\{\\text\{RG\}\},

hAVh\_\{\\text\{AV\}\}given

𝒬\\mathcal\{Q\},

eIRe\_\{\\text\{IR\}\},

hRGh\_\{\\text\{RG\}\}
4:\#𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}verifies the response\.

5:

𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}predicts

yAVy\_\{\\text\{AV\}\},

eAVe\_\{\\text\{AV\}\}given

𝒬\\mathcal\{Q\},

𝒴\\mathcal\{Y\},

hAVh\_\{\\text\{AV\}\}
6:if

yAV=“Correct”y\_\{\\text\{AV\}\}=\\text\{\`\`Correct''\}then

7:return

𝒴\\mathcal\{Y\}
8:else

9:\#Re\-executing with wrongAns𝒴\\mathcal\{Y\}\.

10:goto Start

11:endif

12:else

13:\#h=hKRh=h\_\{\\text\{KR\}\}answer the question𝒬\\mathcal\{Q\}\.

14:foreach

ppin

q1,…,qmq\_\{1\},\\dots,q\_\{m\}do

15:Retrieve

\(d1,…,dk\)\(d\_\{1\},\\dots,d\_\{k\}\)using

𝒜KR\\mathcal\{A\}\_\{\\text\{KR\}\}given

pp, top\-

kk
16:endfor

17:

D=\{d1,…,dk⋅m\}D=\\\{d\_\{1\},\\dots,d\_\{k\\cdot m\}\\\}
18:\#𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}filters out unrelateddid\_\{i\}\.

19:Filter

D=\{d1,…,dw\}D=\\\{d\_\{1\},\\dots,d\_\{w\}\\\}using

𝒜KF\\mathcal\{A\}\_\{\\text\{KF\}\}given

𝒬\\mathcal\{Q\},

qq
20:\#𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}locates the key text spanyKLiy\_\{\\text\{KL\}\}^\{i\}\.

21:

𝒜KL\\mathcal\{A\}\_\{\\text\{KL\}\}predicts

\{\(r1,yKL1\),…,\(rw,yKLw\)\},eKL,hRG\\\{\(r\_\{1\},y\_\{\\text\{KL\}\}^\{1\}\),\\dots,\(r\_\{w\},y\_\{\\text\{KL\}\}^\{w\}\)\\\},e\_\{\\text\{KL\}\},h\_\{\\text\{RG\}\}given

𝒬\\mathcal\{Q\},

\{d1,…,dw\}\\\{d\_\{1\},\\dots,d\_\{w\}\\\},

eKRe\_\{\\text\{KR\}\},

hKLh\_\{\\text\{KL\}\}
22:if

r=\[Relevant\]r=\\text\{\[Relevant\]\}then

23:

𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}predicts

𝒴\\mathcal\{Y\},

eRGe\_\{\\text\{RG\}\},

hAVh\_\{\\text\{AV\}\}given

𝒬\\mathcal\{Q\},

eKLe\_\{\\text\{KL\}\},

hR​Gh\_\{RG\},

\{\(r1,yKL1\),…,\(rw,yKLw\)\}\\\{\(r\_\{1\},y\_\{\\text\{KL\}\}^\{1\}\),\\dots,\(r\_\{w\},y\_\{\\text\{KL\}\}^\{w\}\)\\\}
24:else

25:

𝒜RG\\mathcal\{A\}\_\{\\text\{RG\}\}predicts

𝒴\\mathcal\{Y\},

eRGe\_\{\\text\{RG\}\},

hAVh\_\{\\text\{AV\}\}given

𝒬\\mathcal\{Q\},

hRGh\_\{\\text\{RG\}\}
26:endif

27:\#𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}verifies the response\.

28:

𝒜AV\\mathcal\{A\}\_\{\\text\{AV\}\}predicts

yAVy\_\{\\text\{AV\}\},

eAVe\_\{\\text\{AV\}\}given

𝒬\\mathcal\{Q\},

𝒴\\mathcal\{Y\},

hAVh\_\{\\text\{AV\}\}
29:if

yAV=“Correct”y\_\{\\text\{AV\}\}=\\text\{\`\`Correct''\}then

30:return

𝒴\\mathcal\{Y\}
31:else

32:goto Start

33:endif

34:endif

Similar Articles

ACC: Compiling Agent Trajectories for Long-Context Training

arXiv cs.CL

ACC converts multi-turn agent trajectories into long-context QA pairs to train LLMs on long-range reasoning without additional annotation, achieving significant gains on MRCR and GraphWalks benchmarks while preserving general capabilities.

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Hugging Face Daily Papers

TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv cs.AI

This paper introduces AgentAtlas, a framework that goes beyond outcome-only leaderboards for LLM agents by proposing a six-state control-decision taxonomy and a nine-category trajectory-failure taxonomy to evaluate agent behavior more comprehensively.