A Sober Look at Agentic Misalignment in Automated Workflows

arXiv cs.AI 05/26/26, 04:00 AM Papers
Summary
This paper studies agentic misalignment in multi-agent systems with automated workflows, proposing Agentic Evidence Attribution (AEA) to correct misaligned agent behavior using context-specific evidence.
arXiv:2605.24197v1 Announce Type: new Abstract: We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:06 AM
# A Sober Look at Agentic Misalignment in Automated Workflows
Source: [https://arxiv.org/html/2605.24197](https://arxiv.org/html/2605.24197)
Wenqian Ye1,Bo Yuan2,Zhichao Xu3,Yijun Tian3,Yawei Wang3, Henry Kautz1,Aidong Zhang1 1University of Virginia2Georgia Institute of Technology3Amazon AWS AI \{wenqian, aidong\}@virginia\.edu

###### Abstract

We study a class of emergent misalignment in multi\-agent systems \(MAS\), with a focus on automated workflows, which we refer toagentic misalignment\. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals\. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead toposterior collapseof agents in automated workflows\. To address this issue, we propose Agentic Evidence Attribution \(AEA\), a novel alignment paradigm that improves agent posteriors using context\-specific evidence\. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration\. To better understand the role of evidence, we study two instantiations of AEA: self\-reflection \(internal evidence from the model\) and weak\-to\-strong generalization \(external evidence on the agentic trajectory\)\. We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution\. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence\-based alignment can effectively improve agent collaboration and leads to reliable multi\-agent systems built on automated workflows\.

## 1Introduction

Multi\-agent systems \(MAS\) with Large Language Models \(LLM\) are increasingly used for tasks that require decomposition\[[27](https://arxiv.org/html/2605.24197#bib.bib27),[9](https://arxiv.org/html/2605.24197#bib.bib9)\], tool use\[[28](https://arxiv.org/html/2605.24197#bib.bib28),[43](https://arxiv.org/html/2605.24197#bib.bib43)\], and multi\-step reasoning\[[2](https://arxiv.org/html/2605.24197#bib.bib2),[25](https://arxiv.org/html/2605.24197#bib.bib25)\]\. Most recent MAS leverage an orchestrator \(or meta\-agent\) to construct various roles of agents and arrange the order of interactions through a technique called automated workflow generation\[[19](https://arxiv.org/html/2605.24197#bib.bib19),[45](https://arxiv.org/html/2605.24197#bib.bib45)\]\. This technique produces a general solution to build collaboration between agents to autonomously plan and act on individual tasks\. Although these workflows allow agents to handle complex objectives, recent studies have also reported that MAS often fails in several systematic ways\[[6](https://arxiv.org/html/2605.24197#bib.bib6),[12](https://arxiv.org/html/2605.24197#bib.bib12),[16](https://arxiv.org/html/2605.24197#bib.bib16),[21](https://arxiv.org/html/2605.24197#bib.bib21)\]\. These alignment failures arise even when each individual agent performs well in isolation, suggesting an emerging underlying coordination problem within the agentic workflow\.

One qualitative perspective for interpreting these misalignments is that agents act according to a shared utility \(or reward in some literature\) shaped by generic pretraining and post\-training, rather than the role\-specific goals required by the workflow \(Left in Figure[1](https://arxiv.org/html/2605.24197#S1.F1)\)\[[17](https://arxiv.org/html/2605.24197#bib.bib17),[29](https://arxiv.org/html/2605.24197#bib.bib29),[4](https://arxiv.org/html/2605.24197#bib.bib4)\]\. In automated workflows, agents observe only partial information about the generic objective and the behavior of other agents\. As a result, the agent forms an implicit utility that is shaped by a generic training distribution rather than the intended role with respect to the agents themselves\. This implicit utility can diverge from the objective of the whole workflow and lead to reward hacking\[[31](https://arxiv.org/html/2605.24197#bib.bib31)\], including misaligned behaviors such as effort minimization, incomplete hand\-offs, and short\-term decisions that potentially harm other agents’ goals\[[37](https://arxiv.org/html/2605.24197#bib.bib37),[11](https://arxiv.org/html/2605.24197#bib.bib11)\]\. These harmful behaviors make automated workflows difficult to control and limit their reliability in deployment\.

![Refer to caption](https://arxiv.org/html/2605.24197v1/x1.png)Figure 1:Left:Agents in the workflow suffer fromposterior collapse, where the generic posterior remains uncertain and fails to identify true roleθ∗\\theta^\{\*\}, leading to system failure\.Right:AEA injects structured evidenceEtE\_\{t\}to induce variance contraction\. This sharpens the agent’s belief into an evidential posterior, ensuring the agent aligns with specific role\.To get a more rigorous understanding on the mechanistic of these behaviors, we systematically study this problem by modeling the multi\-agent collaboration process as Bayesian inference over latent utility\. We hypothesize that each individual agent infers a posterior utility based on its own evidence \(from workflow system prompts, intermediate agent states, and observation of other agents\) during workflow execution\. When the evidence is sparse or generic, the agent’s inferred utility will deviate from the role\-specific objective, producing inconsistent behavior across steps\. This perspective connects MAS failures to a form of utility misspecification: the workflow produces an incomplete likelihood model, and generic alignment training produces a dominant prior that does not separate the agents’ roles\. Both effects increase the chance of misalignment in workflows\.

Based on these insights, we proposeAgentic Evidence Attribution\(AEA\) to enhance evidence for better agentic posterior, a novel alignment paradigm that strengthens the inferred utility of each agent in additional structured evidence extracted from the workflow \(shown in Figure[1](https://arxiv.org/html/2605.24197#S1.F1)right\)\. AEA analyzes turn\-level traces from existing MAS trajectories, identifies which actions support or harm the task, and assigns context\-aware and role\-specific feedback\. This evidence reduces the ambiguity in the agent’s utility posterior and improves coordination\. Because AEA operates on workflow traces, it serves as a flexible, model\-agnostic framework that can align proprietary multi\-agent systems without requiring access to their internal representations\.

We evaluate a diverse set of automated multi\-agent workflows on a wide range of challenging human\-level tasks, including programming, data analysis, scientific reasoning, and competition\-level mathematics\. We study two instantiations of the proposed framework:Self\-Reflection, where the MAS use the base model to extract evidence from its own traces, andWeak\-to\-Stronggeneralization, where a separate trained evidence model is used for agentic failure attribution and alignment\. Across various models and benchmarks, AEA consistently reduces coordination failures and improves reliability relative to vanilla multi\-agent baselines\. Notably, the gains are not explained by additional computation or naive test\-time scaling, but by correcting role\-level decision errors that would otherwise propagate through the workflow\. These results support the view that many multi\-agent failures stem from utility misspecification rather than lack of capability, and that evidence\-conditioned alignment is an effective mechanism for improving MAS automated workflows\.

## 2Related Works

Agentic Misalignment\.Misaligned behaviors in reinforcement learning arise when an agent exploits an imperfect reward function to maximize returns without achieving the human’s intended goal\[[35](https://arxiv.org/html/2605.24197#bib.bib35)\]\. This reward hacking phenomenon deeply aligns with the well\-known Goodhart’s Law\[[15](https://arxiv.org/html/2605.24197#bib.bib15)\]and has been widely observed in LLM alignment, where preference datasets encode spurious cues such as verbosity or sycophancy that models can exploit\[[41](https://arxiv.org/html/2605.24197#bib.bib41)\]\. Recent work highlights that agentic systems also inherit these risks: goal misspecification and goal misgeneralization can cause agents to deviate from human intent under distribution shifts, producing systematic failures as capabilities grow\[[4](https://arxiv.org/html/2605.24197#bib.bib4),[33](https://arxiv.org/html/2605.24197#bib.bib33)\]\. These observations motivate a more careful study of reward\-driven behavior in multi\-agent settings\.

Automated Workflows\.Modern agentic systems rely on automated workflow generation to assign roles and determine the sequence of agent actions\. These workflows are produced through prompting or high\-level planners \(i\.e\., orchestrators/meta\-agents\), which implicitly induce a posterior over each agent’s utility\[[23](https://arxiv.org/html/2605.24197#bib.bib23)\]\. However, this posterior could be unstable: small prompt changes or planning errors can cause agents to minimize effort, pass insufficient information, or misinterpret their responsibilities\[[6](https://arxiv.org/html/2605.24197#bib.bib6)\]\. Because the underlying reward is generic rather than role\-specific, agents optimize for signals that do not match the workflow’s true intent, leading to reward hacking across the steps of the workflow\. This gap between automatically generated roles and the generic utility that agents optimize remains a central barrier to trustworthy and reliable multi\-agent systems\. In this paper, we focus mainly on thecentralized MASdiscussed inKim et al\. \[[16](https://arxiv.org/html/2605.24197#bib.bib16)\]where we have an orchestrator to generate the subagents, since it represents the main stream of current deployed MAS\.

Reward Modeling for Agents\.Reward models \(RMs\) play a critical role in aligning LLMs with human preferences\. Recent work frames alignment as Bayesian inference, where an RM provides evidence that shapes the learned posterior\[[17](https://arxiv.org/html/2605.24197#bib.bib17),[40](https://arxiv.org/html/2605.24197#bib.bib40)\]\. Existing RMs, whether scalar or generative, assume a single shared objective and do not account for the role\-specific structure of multi\-agent workflows, which limits their ability to prevent coordination failures\. Our work addresses this gap by producing context\-aware and agent\-specific signals\. One related work\[[26](https://arxiv.org/html/2605.24197#bib.bib26)\], namely Variational Preference Learning \(VPL\) infers latent user context to model personalized preferences\. However, VPL requires architecture\-level integration \(training specific encoders/decoders\) to learn the latent distributions of LLMs, whereas AEA is workflow\-agnostic and operates purely in the textual context/prompting space accessible to LLMs\. Crucially, our theoretical framework fundamentally reinterprets this dynamic: rather than treating the reward signal as the optimization objective itself, we model it asexternal observational evidencerequired to contract the variance of the agent’sinternal latent utility, thereby resolving the posterior collapse inherent to pre\-trained priors\.

## 3A Sober Look at Agentic Misalignment

### 3\.1Problem Formulation

We formalize the automated workflow as a partially observed multi\-agent Markov Decision Process \(POMDP\)\[[34](https://arxiv.org/html/2605.24197#bib.bib34),[46](https://arxiv.org/html/2605.24197#bib.bib46)\]augmented with latent role variables\. The system is defined by the tupleℳ=⟨𝒩,𝒮,𝒜,𝒯,Θ,𝒰⟩\\mathcal\{M\}=\\langle\\mathcal\{N\},\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\Theta,\\mathcal\{U\}\\rangle\. Here,𝒩=\{1,…,N\}\\mathcal\{N\}=\\\{1,\\dots,N\\\}represents the set of agent roles, while𝒮\\mathcal\{S\}denotes the global state space that consists of the task context, dialogue history, and tool outputs\. The joint action space is given by𝒜=×i∈𝒩𝒜i\\mathcal\{A\}=\\times\_\{i\\in\\mathcal\{N\}\}\\mathcal\{A\}\_\{i\}, where𝒜i\\mathcal\{A\}\_\{i\}is the specific action space for agentii\.𝒯:𝒮×𝒜→Δ\(𝒮\)\\mathcal\{T\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\)denotes the state transition function withΔ\(𝒮\)\\Delta\(\\mathcal\{S\}\)as the probability simplex over states\. To define role\-specific behaviors, we introduceΘ=\{θ1,…,θK\}\\Theta=\\\{\\theta\_\{1\},\\dots,\\theta\_\{K\}\\\}as a finite set oflatent utility types\(e\.g\., “strict verifier”, “accurate approximator”\)\. Lastly,𝒰:𝒮×𝒜×Θ→ℝ\\mathcal\{U\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\times\\Theta\\to\\mathbb\{R\}represents the utility function conditioned on the roles\.

At any time steptt, a single agenti=ψ\(t\)i=\\psi\(t\)is active according to the workflow orchestration\. The agent does not observe the true role intention directly\. Instead, the ground\-truth role assignment is a random variable𝜽i∈Θ\\boldsymbol\{\\theta\}\_\{i\}\\in\\Theta, following a prior distributionP\(𝜽\)P\(\\boldsymbol\{\\theta\}\)\. Theoracle\(ideal\) policy maximizes the true utility associated with this latent type:

πi∗\(st;θ\)∈arg⁡maxa∈𝒜i⁡𝒰\(st,a;θ\)\.\\pi^\{\*\}\_\{i\}\(s\_\{t\};\\theta\)\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{i\}\}\\;\\mathcal\{U\}\(s\_\{t\},a;\\theta\)\.\(1\)
###### Definition 3\.1\(Agentic Misalignment via Decisive Errors\)\.

Letτ\\taube a trajectory andZ\(τ\)∈\{0,1\}Z\(\\tau\)\\in\\\{0,1\\\}be a binary failure indicator \(0means no failure\)\. A time stepttis aDecisive ErrorifZ\(τ\)=1Z\(\\tau\)=1, but there exists an actiona~t∈𝒜i\\tilde\{a\}\_\{t\}\\in\\mathcal\{A\}\_\{i\}such that the modified trajectoryτ~\\tilde\{\\tau\}satisfiesZ\(τ~\)=0Z\(\\tilde\{\\tau\}\)=0\.Agentic Misalignmentoccurs when the agent’s policyπ^i\\hat\{\\pi\}\_\{i\}selectsata\_\{t\}\(the error\) instead ofa~t\\tilde\{a\}\_\{t\}becauseata\_\{t\}maximizes the expected utility under thegenericposterior, distinct from the specific role type𝜽i\\boldsymbol\{\\theta\}\_\{i\}:

at∈arg⁡maxa⁡𝔼θ∼P\(⋅∣Y\)\[𝒰\(st,a;θ\)\],a\_\{t\}\\in\\arg\\max\_\{a\}\\mathbb\{E\}\_\{\\theta\\sim P\(\\cdot\\mid Y\)\}\[\\mathcal\{U\}\(s\_\{t\},a;\\theta\)\],\(2\)whereat≠πi∗\(st;𝜽i\)a\_\{t\}\\neq\\pi^\{\*\}\_\{i\}\(s\_\{t\};\\boldsymbol\{\\theta\}\_\{i\}\)andYYis the evidence of the agent\.

### 3\.2Misalignment as a Role Inference Problem

The core obstacle is that agentiishould infer the specific instantiation𝜽i\\boldsymbol\{\\theta\}\_\{i\}from the available evidenceYtY\_\{t\}at timett\. We model this as Bayesian inference where the agent maintains a belief statebt\(θ\)=P\(𝜽i=θ∣Yt\)b\_\{t\}\(\\theta\)=P\(\\boldsymbol\{\\theta\}\_\{i\}=\\theta\\mid Y\_\{t\}\)\. The failure of multi\-agent systems stems fromposterior collapse, where the generic pre\-training prior dominates the weak signal provided by the workflow prompt\.

We quantify the divergence between two distributions using the Total Variation \(TV\) distance,‖P−Q‖TV=12∑\|P\(x\)−Q\(x\)\|\\\|P\-Q\\\|\_\{\\mathrm\{TV\}\}=\\frac\{1\}\{2\}\\sum\|P\(x\)\-Q\(x\)\|\. To rigorously bound the posterior update, we introduce a stability condition on the marginal likelihood \(evidence\)\.

###### Theorem 3\.2\(ϵ\\epsilon\-Stability of Role Posteriors\)\.

Leti,j∈𝒩i,j\\in\\mathcal\{N\}be two distinct roles\. Assume the priors areϵπ\\epsilon\_\{\\pi\}\-close:∥P\(⋅∣i\)−P\(⋅∣j\)∥TV≤ϵπ\\\|P\(\\cdot\\mid i\)\-P\(\\cdot\\mid j\)\\\|\_\{\\mathrm\{TV\}\}\\leq\\epsilon\_\{\\pi\}and the likelihoods areϵℓ\\epsilon\_\{\\ell\}\-close\. Furthermore, assume the workflow evidenceYYis sufficiently informative such that the marginal likelihood is lower\-bounded byζ\>0\\zeta\>0for both roles\. Then, the posterior distributions areδ\\delta\-close, whereδ=κ\(2ϵπ\+ϵℓ\)\\delta=\\kappa\(2\\epsilon\_\{\\pi\}\+\\epsilon\_\{\\ell\}\)andκ∝ζ−1\\kappa\\propto\\zeta^\{\-1\}\. Consequently, the probability that agentsiiandjjselect distinct actions is upper\-bounded:

P\(π^i\(Y\)≠π^j\(Y\)\)≤δ\.P\(\\hat\{\\pi\}\_\{i\}\(Y\)\\neq\\hat\{\\pi\}\_\{j\}\(Y\)\)\\leq\\delta\.\(3\)If the oracle requirements differ \(πi∗≠πj∗\\pi^\{\*\}\_\{i\}\\neq\\pi^\{\*\}\_\{j\}\), the probability of misalignment is lower\-bounded by1−δ1\-\\delta\.

This result implies that, without distinct external evidence, agents inevitably collapse toward a mean generic behavior\. To break this symmetry, we should introduce a principled mechanism that explicitly conditions the posterior on role\-specific artifacts\. This motivates our proposed framework\.

## 4Agentic Evidence Attribution

In this section, we present Agentic Evidence Attribution \(AEA\), an analytical and practical framework for correcting agentic misalignment\. We first establish an analytical framework in Section[4\.1](https://arxiv.org/html/2605.24197#S4.SS1), defining the limits of alignment via information theory\. Section[4\.2](https://arxiv.org/html/2605.24197#S4.SS2)details the mechanism of posterior contraction, proving how evidence reduces variance in the agent’s latent belief in its role\. Finally, Section[4\.3](https://arxiv.org/html/2605.24197#S4.SS3)proposes scalable instantiations of AEA using learned evidence models to satisfy these theoretical conditions in practice\.

### 4\.1An Analytical Framework

We formalize Agentic Evidence Attribution \(AEA\) as the injection of theextraction of latent evidencefrom the system’s own history\. AEA leverages the information asymmetry between the runtime evidenceYtY\_\{t\}from partial observation and the complete computational traceτ<t\\tau\_\{<t\}\.

LetEt=ℱ\(τ<t\)E\_\{t\}=\\mathcal\{F\}\(\\tau\_\{<t\}\)be an additional evidence derived from the trajectory history using a functionℱ\\mathcal\{F\}\. The agent’s evidence transforms fromYtY\_\{t\}toYt′=\(Yt,Et\)Y^\{\\prime\}\_\{t\}=\(Y\_\{t\},E\_\{t\}\)\. We analyze the theoretical limits of this intervention using information theory\. LetL∗=πi∗\(st;𝜽i\)L^\{\*\}=\\pi^\{\*\}\_\{i\}\(s\_\{t\};\\boldsymbol\{\\theta\}\_\{i\}\)be the discrete label of the optimal action\.

###### Theorem 4\.1\(Lower Bound on Misalignment Error\)\.

Assume the action space has cardinality\|𝒜i\|=M\|\\mathcal\{A\}\_\{i\}\|=M\. Leta^\(Y\)\\hat\{a\}\(Y\)be any decision rule conditioned on evidenceYY\. The probability of decisive error is lower\-bounded by Fano’s Inequality:

Pe≥1−I\(L∗;Y\)\+log⁡2log⁡M,P\_\{e\}\\geq 1\-\\frac\{I\(L^\{\*\};Y\)\+\\log 2\}\{\\log M\},\(4\)whereI\(L∗;Y\)I\(L^\{\*\};Y\)is the mutual information between the true optimal action label and the evidence\.

This lower bound highlights a fundamental limit: no alignment algorithm can succeed without sufficient mutual information between the evidence and the optimal action\. This motivates the design of AEA to explicitly maximize this information channel\.

###### Corollary 4\.2\(Information Gain Condition\)\.

LetPe\(Y\)P\_\{e\}\(Y\)andPe\(Y′\)P\_\{e\}\(Y^\{\\prime\}\)be the probabilities of decisive error under baseline and AEA evidence, respectively\. A necessary condition for AEA to strictly reduce the probabilities on misalignment \(i\.e\.,Pe\(Y′\)<Pe\(Y\)P\_\{e\}\(Y^\{\\prime\}\)<P\_\{e\}\(Y\)\) is

I\(L∗;Et∣Yt\)\>0I\(L^\{\*\};E\_\{t\}\\mid Y\_\{t\}\)\>0\(5\)

Since the base model often suffers fromI\(L∗;Y\)≈0I\(L^\{\*\};Y\)\\approx 0due to posterior collapse, this condition necessitates introducing an external parameterizationϕ\\phisuch that the mutual informationI\(L∗;Y′\)\>0I\(L^\{\*\};Y^\{\\prime\}\)\>0\. Consequently, the success of alignment depends on the quality of the extraction functionℱ\\mathcal\{F\}\. Ifℱ\\mathcal\{F\}yields uninformative evidence, misalignment persists regardless of the base model’s capabilities, necessitating a learned approach for evidence extraction\.

This condition clarifies that AEA is not a manual heuristic\. It represents a valid information channel only if the extraction functionℱ\\mathcal\{F\}captures correlations between past traces and current optimal actions that are invisible in the baseline prompt\.

### 4\.2Mechanism via Posterior Contraction

We now instantiate the utility function𝒰\\mathcal\{U\}as a linear model to demonstrate the mechanism of AEA concretely\. Letϕ\(s,a\)∈ℝd\\phi\(s,a\)\\in\\mathbb\{R\}^\{d\}be a feature vector of the state\-action pair\. Let the latent type𝜽\\boldsymbol\{\\theta\}correspond to a weight vector𝐰∈ℝd\\mathbf\{w\}\\in\\mathbb\{R\}^\{d\}, such that𝒰\(s,a;𝐰\)=𝐰⊤ϕ\(s,a\)\\mathcal\{U\}\(s,a;\\mathbf\{w\}\)=\\mathbf\{w\}^\{\\top\}\\phi\(s,a\)\.

We model the agent’s prior belief as a Gaussian:𝐰∼𝒩\(𝝁0,Σ0\)\\mathbf\{w\}\\sim\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{0\},\\Sigma\_\{0\}\)\. The baseline evidenceYYupdates this to a posterior𝒩\(𝝁Y,ΣY\)\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{Y\},\\Sigma\_\{Y\}\)\. We model the AEA evidenceEtE\_\{t\}as a noisy linear observation of the true utility weights \(e\.g\., passing a unit test confirms alignment with specific utility dimensions\):

Et=𝐇𝐰\+𝜼,𝜼∼𝒩\(0,𝐑\),E\_\{t\}=\\mathbf\{H\}\\mathbf\{w\}\+\\boldsymbol\{\\eta\},\\boldsymbol\{\\eta\}\\sim\\mathcal\{N\}\(0,\\mathbf\{R\}\),\(6\)where𝐇\\mathbf\{H\}projects the latent utility onto observable verification constraints\. and𝐑\\mathbf\{R\}is the covariance of observation noise\.

###### Theorem 4\.3\(Covariance Contraction under Verifiable Evidence\)\.

Under the linear\-Gaussian assumption, the posterior covarianceΣY′\\Sigma\_\{Y^\{\\prime\}\}after observing AEA evidenceEtE\_\{t\}satisfies the Loewner partial orderΣY′⪯ΣY\\Sigma\_\{Y^\{\\prime\}\}\\preceq\\Sigma\_\{Y\}\. Specifically, the precision matrix increases by the Fisher information of the evidence:

ΣY′−1=ΣY−1\+𝐇⊤𝐑−1𝐇\.\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}=\\Sigma\_\{Y\}^\{\-1\}\+\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}\.\(7\)

As visualized in Figure[1](https://arxiv.org/html/2605.24197#S1.F1), this covariance contraction narrows the posterior distribution \(green curve\), significantly reducing the probability mass assigned to misaligned actions compared to the generic posterior\.

Mathematically, this confirms that verifiable evidence reduces the variances in the agent’s latent belief space\. By injectingEtE\_\{t\}, we sharpen the distribution around the true role parameter, reducing the likelihood of drawing a misaligned utility vector\.

###### Corollary 4\.4\(Reliability Improvement via Variance Reduction\)\.

Consider two actionsa1,a2a\_\{1\},a\_\{2\}with feature differenceδ=ϕ\(s,a1\)−ϕ\(s,a2\)\\delta=\\phi\(s,a\_\{1\}\)\-\\phi\(s,a\_\{2\}\)under statess\. Ifa1a\_\{1\}is optimal \(𝐰⊤δ\>0\\mathbf\{w\}^\{\\top\}\\delta\>0\) and the evidenceEtE\_\{t\}is unbiased \(does not decrease the margin meanμ\\mu\), thenPe\(Y′\)≤Pe\(Y\)P\_\{e\}\(Y^\{\\prime\}\)\\leq P\_\{e\}\(Y\)\.

This corollary provides the probabilistic guarantee for AEA\. By reducing the posterior variance, we strictly increase the probability of selecting the correct role\-aligned action\. The following sections detail how we implement this via learned evidence models\. The proofs of all the theorems and corollaries are given in the Appendix Section[C](https://arxiv.org/html/2605.24197#A3)\.

### 4\.3Scalable Instantiation via Learned Evidence

The theoretical guarantees of AEA are based on the existence of an extraction functionℱ\\mathcal\{F\}that satisfies the Information Gain Condition \(Corollary[4\.2](https://arxiv.org/html/2605.24197#S4.Thmtheorem2)\)\. Crucially, this evidenceEtE\_\{t\}is not defined by manual rules\. Instead, we define AEA as a learned parameterized functionℱϕ\(τ<t\)\\mathcal\{F\}\_\{\\phi\}\(\\tau\_\{<t\}\)\. We propose two instantiations to implement this function, motivated by the limitations of the agent’s prior belief\.

#### I\. Self\-Reflection \(Bootstrapping\)\.

The simplest instantiation approximatesℱ\\mathcal\{F\}using the base model itself:Et≈ℳbase\(τ<t\)E\_\{t\}\\approx\\mathcal\{M\}\_\{\\text\{base\}\}\(\\tau\_\{<t\}\)\. However, this approach is theoretically limited by the Dominant Prior assumption \(Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)\)\. Since the pre\-trained priorP\(θ\)P\(\\theta\)is shared across roles and favors generic behaviors, querying the same model often resamples from the same uninformative posterior\. If the base model’s prior dictates thatP\(θ∣i\)≈P\(θ∣j\)P\(\\theta\\mid i\)\\approx P\(\\theta\\mid j\), self\-reflection frequently fails to break this symmetry, resulting in "hallucinated compliance" where the agent rationalizes its generic behavior rather than correcting it\.

#### II\. Weak\-to\-Strong Generalization\.

To resolve the negative impact of the dominant prior, we introduce a distinct evidence modelℳω\\mathcal\{M\}\_\{\\omega\}\(e\.g\., a smaller open\-sourced model\) to parameterizeℱ\\mathcal\{F\}withω\\omega\. Unlike the base model,ℳω\\mathcal\{M\}\_\{\\omega\}is not intended to solve the task but is trained via reinforcement learning specifically to maximize the mutual informationI\(L∗;Et∣Yt\)I\(L^\{\*\};E\_\{t\}\\mid Y\_\{t\}\)\. By optimizingω\\omegaon a distribution of successful/failure traces, this model learns to identify the specific latent variablesθ\\thetathat distinguish roles\. These features often ignored by the generic prior of the large model\. This ensures that the generated evidenceEtE\_\{t\}explicitly targets the latent failure modes, transforming alignment from a manual engineering task into a scalable learning problem that strictly satisfies the Information Gain Condition\.

Table 1:Main agentic evaluation on six benchmarks\. For each base model, we compare Single Agent, Multi Agent \(no alignment\), Self\-Reflection \(AEA\), and Weak\-to\-Strong \(AEA\)\. The best performance among two alignment methods for each benchmark is highlighted inboldface\.

## 5Experiments

### 5\.1Agentic Trace Generation

To learn the generalizable agentic trace distribution of the MAS, we construct agentic trace reasoning data from four challenging agent benchmarks shown in the Appendix, Table[10](https://arxiv.org/html/2605.24197#A6.T10):GAIA\[[24](https://arxiv.org/html/2605.24197#bib.bib24)\],AssistantBench\[[42](https://arxiv.org/html/2605.24197#bib.bib42)\],LiveBench\[[38](https://arxiv.org/html/2605.24197#bib.bib38)\], andWho&When\[[46](https://arxiv.org/html/2605.24197#bib.bib46)\]\. The first three benchmarks are originally designed for single\-agent evaluation and focus on various challenging tasks such as coding, reasoning, tool use, and web interaction\. In contrast,Who&Whenis a native multi\-agent benchmark that targets failure attribution in LLM\-based multi\-agent systems\. However, the current scale of these benchmarks is limited and insufficient on its own to learn generalizable agentic evidence signals and to reflect the real distributional behavior of agents\. We therefore leverage all four benchmarks to construct a unified agentic reasoning dataset by running them under automated workflows and collecting annotated execution traces\.

For each task, we generate the trajectories in the json format\. We first use Claude\-4 Opus\[[3](https://arxiv.org/html/2605.24197#bib.bib3)\]to produce initial pseudo\-annotations that identify the failing agent, the decisive step, and a candidate repair action\. Similar toZhang et al\. \[[46](https://arxiv.org/html/2605.24197#bib.bib46)\], these annotations are then reviewed by a team of 5 human experts, who verify the correctness of the failure attribution, adjust the identified steps when necessary, and rewrite prompt optimization suggestions to ensure they are executable and role\-consistent\. The final dataset contains human\-verified agentic traces with structured evidence labels, which are used to train the AEA evidence model\. Details of the MAS implementation are provided in Section[5\.3](https://arxiv.org/html/2605.24197#S5.SS3)\.

### 5\.2Evaluation

To evaluate the performance of both single\-agent and multi\-agent settings, we use six human\-level challenging benchmarks that span a wide range of difficult tasks\.HumanEval\[[7](https://arxiv.org/html/2605.24197#bib.bib7)\]tests the generation of functional code with human verification\.DataBench\[[10](https://arxiv.org/html/2605.24197#bib.bib10)\]focuses on tool use and decision making in realistic tabular data analysis\. InDataBench, we only provide the link for the tabular data, which significantly increases the difficulty\. SciBench\[[36](https://arxiv.org/html/2605.24197#bib.bib36)\]provides two scientific reasoning tracks inChemistryandPhysicsthat require accurate, grounded problem solving\.AIME24\[[13](https://arxiv.org/html/2605.24197#bib.bib13)\]andAIME25\[[14](https://arxiv.org/html/2605.24197#bib.bib14)\]measure mathematical reasoning with competition\-level difficulty\. For performance evaluation, we use LLM\-as\-a\-Judge with GPT\-4o\[[1](https://arxiv.org/html/2605.24197#bib.bib1)\]to compare ground truth answers and multi\-agent results\. The evaluation prompts are provided in the Appendix[G](https://arxiv.org/html/2605.24197#A7)\.

### 5\.3Implementation Details

Multi\-Agent System\.We implement MAS using CaptainAgent from the AG2 library\[[39](https://arxiv.org/html/2605.24197#bib.bib39)\]for automated workflow generation\. CaptainAgent constructs workflows by assigning roles to LLM agents and coordinating their turn\-by\-turn interactions through a shared memory and tool interface\. Each agent receives a role description, observes the partial states of the system, and produces an action that updates the workflow\. This design makes CaptainAgent a representative automated workflow system with the support of modular agent roles, sequential execution, and tool\-based actions\. This setup provides a natural testbed for studying misalignment in multi\-agent workflows\.

Weak\-to\-Strong Training\.We train AEA\-4B by optimizing a reasoning model \(using Qwen3\-4B as the backbone\) on multi\-agent workflow traces via a two\-stage procedure\. Each training example contains the full workflow context \(roles, turn\-level dialog, tool calls, and results\) and a target evidence output that specifies the failing agent, the decisive step, and a concrete correction\. In the first stage, we apply a supervised warm start on the RM\-R1\[[8](https://arxiv.org/html/2605.24197#bib.bib8)\]dataset to align the model with standard preference representations\. In the second stage, we apply GRPO\-based reinforcement learning with verifiable rewards\[[30](https://arxiv.org/html/2605.24197#bib.bib30)\]\. The reward function is decomposed into four weighted components: \(1\)agent identification\(40%\), checking for exact name matches; \(2\)rating alignment\(30%\), penalizing deviations from ground\-truth failure severity; \(3\)correction validity\(20%\), verifying that proposed fixes are non\-trivial; and \(4\)reasoning completeness\(10%\), encouraging comprehensive failure attribution\. Invalid JSON formats receive a fixed penalty of−0\.8\-0\.8, with the total reward clamped to\[−1,1\]\[\-1,1\]\. We set the learning rate to1×10−61\\times 10^\{\-6\}and the number of group rollouts toG=7G=7\. We maximize the GRPO objective:

𝒥\(ω\)=1G∑i=1G\[min\(ρi\(ω\)Ai,clip\(ρi\(ω\),1−ε,1\+ε\)Ai\)−β𝔻KL\(ℳω\|\|ℳref\)\]\\displaystyle\\mathcal\{J\}\(\\omega\)=\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\[\\min\(\\rho\_\{i\}\(\\omega\)A\_\{i\},\\text\{clip\}\(\\rho\_\{i\}\(\\omega\),1\-\\varepsilon,1\+\\varepsilon\)A\_\{i\}\)\-\\beta\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\mathcal\{M\}\_\{\\omega\}\|\|\\mathcal\{M\}\_\{\\text\{ref\}\}\)\]\(8\)whereℳω\\mathcal\{M\}\_\{\\omega\}denotes the evidence model parameterized byω\\omega,ρi\(ω\)=ℳω\(yi\|x\)ℳold\(yi\|x\)\\rho\_\{i\}\(\\omega\)=\\frac\{\\mathcal\{M\}\_\{\\omega\}\(y\_\{i\}\|x\)\}\{\\mathcal\{M\}\_\{\\text\{old\}\}\(y\_\{i\}\|x\)\}is the probability ratio with the old model policy, andAiA\_\{i\}is the group\-relative advantage\. We use a clipping range ofε=0\.2\\varepsilon=0\.2and a KL penaltyβ=0\.001\\beta=0\.001, relative to reference modelℳref\\mathcal\{M\}\_\{\\text\{ref\}\}\. The maximum context length is set to 12,288 for prompts and 6,144 for responses\. We train 5 epochs with fixed prompts and a shared output schema to ensure the evidence model generalizes uniformly across different workflows\.

Table 2:Failure Attribution of LLM Multi\-Agent Systems on Who&When Dataset\[[46](https://arxiv.org/html/2605.24197#bib.bib46)\]\.Left:All at Once\.Right:Step by Step\. Note that we only use theAll at Oncesetting in our proposed AEA alignment method\.

### 5\.4Main Results

Inefficient agentic scaling in automated workflows\.Table[1](https://arxiv.org/html/2605.24197#S4.T1)demonstrates that transitioning from a Single Agent to a Multi\-Agent workflow generally improves performance, particularly for smaller base models\. For instance, Claude 3 Haiku sees a substantial jump onHumanEval\(61\.6% to 79\.8%\) andPhysics\(23\.5% to 36\.7%\), suggesting that decomposition helps mitigate reasoning limitations\. However, this “Agentic Test\-Time Scaling” comes at a cost\. As illustrated in Figure[2](https://arxiv.org/html/2605.24197#S5.F2), the multi\-agent system incurs a massive computational overhead, with response times increasing by a factor of 12 to 13\. More critically, this scaling exhibits fragility on complex tasks such asChemistry, where unaligned workflows often yield diminishing returns\. We hypothesize that without explicit alignment, increased interactions amplify the probability of “decisive errors,” where a single misaligned agent propagates incorrect constraints that poison the collaborative context\.

![Refer to caption](https://arxiv.org/html/2605.24197v1/x2.png)Figure 2:Average response time across agent configurations\. Multi\-agent systems incur 13×\\timesoverhead compared to single\-agent baselines\.AEA effectively enforces role adherence\.Our proposed method, AEA, consistently improves performance across six benchmarks, effectively reversing the degradation seen in original workflows\. The gains are most pronounced in environments that require strict adherence to protocols, such asDataBenchandAIME\. This empirically supports our theoretical claim that failures are driven byposterior collapse\. In the absence of AEA, agents default to generic behaviors rather than navigating role\-specific constraints\. By injecting trace\-based evidence, AEA contracts the variance of the agent’s utility posterior, reducing role ambiguity and forcing actions to align with structural requirements\. Case studies can be found at Appendix[H](https://arxiv.org/html/2605.24197#A8)and[I](https://arxiv.org/html/2605.24197#A9)\.

Weak\-to\-Strong breaks the Dominant Prior\.A key finding is the distinct behavior of Weak\-to\-Strong generalization compared to Self\-Reflection\. As shown in Figure[3\(a\)](https://arxiv.org/html/2605.24197#S5.F3.sf1), the rating distribution for Self\-Reflection is heavily skewed towards high scores \(4 and 5\), even when the system fails\. This empirically validates the Dominant Prior assumption \(Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)\)\. Because the self\-reflecting agent shares the same pre\-trained prior as the acting agent, it tends to “rationalize” decisive errors rather than correcting them, leading to hallucinated compliance\. In contrast, AEA \(via Weak\-to\-Strong\) produces a more discriminative distribution of ratings, effectively detecting and mitigating misalignments that the base model ignores\.

AEA enables robust error correction\.We further analyze the impact of AEA in Figure[3\(b\)](https://arxiv.org/html/2605.24197#S5.F3.sf2)by decomposing the results into three types: \(1\)Fixed\(Incorrect→\\toCorrect\), where the alignment successfully repairs a failure; \(2\)Preserved\(Correct→\\toCorrect\), where the correct solution is maintained; and \(3\)Regressed\(Correct→\\toIncorrect\), where the alignment breaks a previously correct solution\. The results reveal that Self\-Reflection suffers from a much higher regression rate, frequently overriding correct reasoning with generic hallucinations\. Conversely, AEA \(Weak\-to\-Strong\) significantly reduces regression and achieves a higher frequency ofFixedoutcomes\. This shows that orthogonal evidence from a specialized model is necessary to effectively correct decisive errors without compromising workflow stability\.

### 5\.5Automated Failure Attribution

The efficacy of AEA is based on its ability to accurately identifywhich agentandwhich stepin a workflow is failing\. Table[2](https://arxiv.org/html/2605.24197#S5.T2)validates this mechanism on theWho&Whenbenchmark\. In “All at Once” setting, general\-purpose models \(even Llama\-3\.1\-70B and GPT\-4o\) struggle significantly, indicating a lack of sensitivity to causal links in multi\-agent failures\. Conversely, our specialized AEA\-4B model achieves the best accuracy \(Step Accuracy of 32\.70% and Agent Accuracy of 60\.79%\) on identifying agent failures, outperforming the general\-purpose models that are significantly larger\. Our method also shows competitive performance on the “Step by Step” setting\. This result empirically confirms theInformation Gain Condition\(Corollary[4\.2](https://arxiv.org/html/2605.24197#S4.Thmtheorem2)\), showing that specialized training enables AEA to extract the informative evidence required for agentic alignment\.

![Refer to caption](https://arxiv.org/html/2605.24197v1/x3.png)\(a\)Rating and Step distributions\. Self\-Reflection skews towards higher ratings, indicating a failure to detect errors due to shared priors\.
![Refer to caption](https://arxiv.org/html/2605.24197v1/x4.png)\(b\)Error Correction Rate\. Outcomes are Fixed \(Incorrect→\\toCorrect\), Preserved \(Correct→\\toCorrect\), and Regressed \(Correct→\\toIncorrect\)\.

Figure 3:Comparison between Self\-Reflection and Weak\-to\-Strong Generalization\.\(a\)Rating and Step distributions reveal that Self\-Reflection rationalizes errors with high ratings due to the Dominant Prior\.\(b\)Weak\-to\-Strong AEA significantly reduces regressions and increases the rate of fixed errors\.
### 5\.6Empirical Validations on Theoretical Results

To test the effectiveness of the evidence, we use the decisive\-error reductionΔPe=Pe\(Yt\)−Pe\(Yt,Et\)\\Delta P\_\{e\}=P\_\{e\}\(Y\_\{t\}\)\-P\_\{e\}\(Y\_\{t\},E\_\{t\}\)as a behavioral surrogate forI\(L∗;Et∣Yt\)I\(L^\{\*\};E\_\{t\}\\mid Y\_\{t\}\)\(monotone via Theorem[4\.1](https://arxiv.org/html/2605.24197#S4.Thmtheorem1)\) and trace it along anevidence gradienton GPT\-4o:Naive Retry\(no new evidence\),Generic Feedback\(role\-agnostic message\),Self\-Reflection\(same\-prior evidence\), andAEA\-4B\(external evidence model\)\. As shown in Table[3](https://arxiv.org/html/2605.24197#S5.T3), Naive Retry \(0\.0020\.002\), Generic Feedback \(0\.0130\.013\), and Self\-Reflection \(0\.0050\.005\) remain near zero, with Self\-Reflection even regressing onDataBench\(−3\.0\-3\.0pp\) andChemistry\(−4\.8\-4\.8pp\), consistent with the Dominant Prior pulling same\-prior reflection toward rationalization\. AEA\-4B reachesΔPe=0\.071\\Delta P\_\{e\}=0\.071, an order of magnitude above any alternative, confirming that the Information Gain Condition is the operational mechanism in practice: the gain from AEA is driven by genuinely informative evidence rather than extra compute or generic prompting\. The same gradient also yields a practical diagnostic: when AEA repairs a failed trajectory, the bottleneck ismissing evidence\. When it does not, the bottleneck is more likelymissing capability, indicating whether to invest in stronger base models or in a richer evidence\. We discuss the broader efficiency\-reliability trade\-off and the relation to concurrent works in Appendix[F](https://arxiv.org/html/2605.24197#A6)\. More empirical validation of theoretical results are provided in Appendix[D](https://arxiv.org/html/2605.24197#A4)\.

Table 3:Evidence gradient on GPT\-4o\.ΔPe\\Delta P\_\{e\}is the average decisive\-error reduction across the four benchmarks\. Compute alone or generic feedback moveΔPe\\Delta P\_\{e\}marginally; AEA\-4B sits an order of magnitude higher, confirming the Information Gain Condition \(Corollary[4\.2](https://arxiv.org/html/2605.24197#S4.Thmtheorem2)\)\.

## 6Conclusion

In this work, we present a study of agentic misalignment in automated workflows, characterizing it as a systematic failure of Bayesian inference over latent roles where agents’ posterior collapse to generic proxy utilities\. To address this, we introduced Agentic Evidence Attribution \(AEA\), a novel alignment framework for incorporating structured, role\-specific evidence into the agent’s decision process\. Our empirical evaluation highlights two main conclusions\. First, evidence\-conditioned alignment is a powerful lever for improving multi\-agent reliability, often yielding performance gains that test\-time scaling cannot achieve\. Second, the success of Weak\-to\-Strong generalization proves that small, specialized evidence models can effectively provide the orthogonal alignment signals needed to improve powerful automated workflows\. This opens a promising avenue for scalable oversight in multi\-agent systems, suggesting that we can build reliable automated workflows by coupling strong reasoning agents with specialized, evidence\-focused aligners\. We hope that this research can facilitate future strides toward trustworthy deployment of autonomous agentic systems\.

## References

- Achiam et al\. \[2023\]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Aksitov et al\. \[2024\]Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, et al\.Rest meets react: Self\-improvement for multi\-step reasoning llm agent\.In*ICLR 2024 Workshop on Large Language Model \(LLM\) Agents*, 2024\.
- Anthropic \[2024\]Anthropic\.Introducing the next generation of claude, 2024\.URL[https://www\.anthropic\.com/news/claude\-3\-family](https://www.anthropic.com/news/claude-3-family)\.
- Bengio et al\. \[2025\]Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al\.Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?*arXiv preprint arXiv:2502\.15657*, 2025\.
- Brown et al\. \[2024\]Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini\.Large language monkeys: Scaling inference compute with repeated sampling\.*arXiv preprint arXiv:2407\.21787*, 2024\.
- Cemri et al\. \[2025\]Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al\.Why do multi\-agent llm systems fail?*arXiv preprint arXiv:2503\.13657*, 2025\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phil Tillet, Felipe Petroski Such, David W\. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert\-Voss, William H\. Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M\. Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba\.Evaluating large language models trained on code\.*ArXiv*, abs/2107\.03374, 2021\.
- Chen et al\. \[2025\]Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al\.Rm\-r1: Reward modeling as reasoning\.*arXiv preprint arXiv:2505\.02387*, 2025\.
- Erdogan et al\. \[2025\]Lutfi Eren Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami\.Plan\-and\-act: Improving planning of agents for long\-horizon tasks\.In*Forty\-second International Conference on Machine Learning*, 2025\.
- Grijalba et al\. \[2024\]Jorge Osés Grijalba, L Alfonso Urena Lopez, Eugenio Martínez\-Cámara, and Jose Camacho\-Collados\.Question answering over tabular data with databench: A large\-scale empirical evaluation of llms\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\)*, pages 13471–13488, 2024\.
- Guo et al\. \[2024\]Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang\.Large language model based multi\-agents: A survey of progress and challenges\.In*IJCAI*, 2024\.
- Han et al\. \[2024\]Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu\.Llm multi\-agent systems: Challenges and open problems\.*arXiv preprint arXiv:2402\.03578*, 2024\.
- HuggingFace \[2025a\]HuggingFace\.Aime 2024 benchmark \(huggingfaceh4/aime\_2024\)\.[https://huggingface\.co/datasets/HuggingFaceH4/aime\_2024](https://huggingface.co/datasets/HuggingFaceH4/aime_2024), 2025a\.
- HuggingFace \[2025b\]HuggingFace\.Aime 2025 benchmark \(opencompass/aime2025\)\.[https://huggingface\.co/datasets/opencompass/AIME2025](https://huggingface.co/datasets/opencompass/AIME2025), 2025b\.
- Karwowski et al\. \[2024\]Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse\.Goodhart’s law in reinforcement learning\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Kim et al\. \[2025\]Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al\.Towards a science of scaling agent systems\.*arXiv preprint arXiv:2512\.08296*, 2025\.
- Korbak et al\. \[2022\]Tomasz Korbak, Ethan Perez, and Christopher Buckley\.Rl with kl penalties is better viewed as bayesian inference\.In*Findings of the Association for Computational Linguistics: EMNLP 2022*, pages 1083–1091, 2022\.
- Lambert et al\. \[2025\]Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al\.Rewardbench: Evaluating reward models for language modeling\.In*Findings of the Association for Computational Linguistics: NAACL 2025*, pages 1755–1797, 2025\.
- Li et al\. \[2024\]Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang\.Autoflow: Automated workflow generation for large language model agents\.*arXiv preprint arXiv:2407\.12821*, 2024\.
- Liu et al\. \[2025\]Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li\.Rm\-bench: Benchmarking reward models of language models with subtlety and style\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Lynch et al\. \[2025\]Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy\.Agentic misalignment: How llms could be insider threats\.*arXiv preprint arXiv:2510\.05179*, 2025\.
- Ma et al\. \[2025\]Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang\.Diagnosing failure root causes in platform\-orchestrated agentic systems: Dataset, taxonomy, and benchmark\.*arXiv preprint arXiv:2509\.23735*, 2025\.
- MacKay \[2002\]David J\. C\. MacKay\.*Information Theory, Inference & Learning Algorithms*\.Cambridge University Press, USA, 2002\.ISBN 0521642981\.
- Mialon et al\. \[2023\]Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom\.Gaia: a benchmark for general ai assistants\.In*The Twelfth International Conference on Learning Representations*, 2023\.
- Plaat et al\. \[2025\]Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, and Niki Van Stein\.Multi\-step reasoning with large language models, a survey\.*ACM Computing Surveys*, 2025\.
- Poddar et al\. \[2024\]Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques\.Personalizing reinforcement learning from human feedback with variational preference learning\.In*Proceedings of the 38th International Conference on Neural Information Processing Systems*, pages 52516–52544, 2024\.
- Prasad et al\. \[2024\]Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot\.Adapt: As\-needed decomposition and planning with language models\.In*Findings of the Association for Computational Linguistics: NAACL 2024*, pages 4226–4252, 2024\.
- Qin et al\. \[2024\]Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al\.Toolllm: Facilitating large language models to master 16000\+ real\-world apis\.In*ICLR*, 2024\.
- Rădulescu et al\. \[2020\]Roxana Rădulescu, Patrick Mannion, Diederik M Roijers, and Ann Nowé\.Multi\-objective multi\-agent decision making: a utility\-based analysis and survey\.*Autonomous Agents and Multi\-Agent Systems*, 34\(1\):10, 2020\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Skalse et al\. \[2022\]Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger\.Defining and characterizing reward gaming\.*Advances in Neural Information Processing Systems*, 35:9460–9471, 2022\.
- Snell et al\. \[2024\]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*, 2024\.
- Sohl\-Dickstein \[2023\]Jascha Sohl\-Dickstein\.The hot mess theory of AI misalignment: More intelligent agents behave less coherently \.[https://sohl\-dickstein\.github\.io/2023/03/09/coherence\.html](https://sohl-dickstein.github.io/2023/03/09/coherence.html), 2023\.
- Spaan \[2012\]Matthijs TJ Spaan\.Partially observable markov decision processes\.In*Reinforcement learning: State\-of\-the\-art*, pages 387–414\. Springer, 2012\.
- Wang et al\. \[2025\]Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al\.Persona features control emergent misalignment\.*arXiv preprint arXiv:2506\.19823*, 2025\.
- Wang et al\. \[2024\]Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang\.Scibench: Evaluating college\-level scientific problem\-solving abilities of large language models\.In*International Conference on Machine Learning*, pages 50622–50649\. PMLR, 2024\.
- Weng \[2024\]Lilian Weng\.Reward hacking in reinforcement learning\.*lilianweng\.github\.io*, Nov 2024\.URL[https://lilianweng\.github\.io/posts/2024\-11\-28\-reward\-hacking/](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/)\.
- White et al\. \[2025\]Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz\-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al\.Livebench: A challenging, contamination\-limited llm benchmark\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Wu et al\. \[2024\]Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al\.Autogen: Enabling next\-gen llm applications via multi\-agent conversations\.In*First Conference on Language Modeling*, 2024\.
- Wu et al\. \[2025\]Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, et al\.Optimas: Optimizing compound ai systems with globally aligned local rewards\.*arXiv preprint arXiv:2507\.03041*, 2025\.
- Ye et al\. \[2025\]Wenqian Ye, Guangtao Zheng, and Aidong Zhang\.Rectifying shortcut behaviors in preference\-based reward learning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.
- Yoran et al\. \[2024\]Ori Yoran, Samuel Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant\.Assistantbench: Can web agents solve realistic and time\-consuming tasks?In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 8938–8968, 2024\.
- Yuan et al\. \[2025\]Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang\.Easytool: Enhancing llm\-based agents with concise tool instruction\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, 2025\.
- Zhang et al\. \[2025a\]Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan\.Agentracer: Who is inducing failure in the llm agentic systems?*arXiv preprint arXiv:2509\.03312*, 2025a\.
- Zhang et al\. \[2025b\]Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong\-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al\.Aflow: Automating agentic workflow generation\.In*The Thirteenth International Conference on Learning Representations*, 2025b\.
- Zhang et al\. \[2025c\]Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu\.Which agent causes task failures and when? on automated failure attribution of LLM multi\-agent systems\.In*Forty\-second International Conference on Machine Learning*, 2025c\.

## Appendix

## Appendix ABroader Impact

This paper presents work whose goal is to advance the field of machine learning, specifically focusing on the safety and reliability of automated multi\-agent systems \(MAS\)\. By formally characterizingagentic misalignmentand introducing Agentic Evidence Attribution \(AEA\) to mitigate it, our research aims to reduce the risks of uncoordinated behavior and reward hacking in increasingly autonomous workflows\. As MAS are deployed in critical domains such as scientific research, software engineering, and decision support, ensuring that agents adhere to specific role\-based constraints is essential for preventing unintended harmful consequences\. While techniques that improve the controllability of agents could theoretically be leveraged to optimize malicious workflows, we believe that understanding the mechanisms of posterior collapse and providing tools for evidence\-based alignment are fundamental prerequisites for building robust, trustworthy, and human\-aligned AI systems\.

## Appendix BLimitations

AEA focuses on role misspecification in automated multi\-agent workflows, which is one concrete form of agentic misalignment\. Other failures, such as insufficient base\-model capability, incorrect workflow decomposition, tool errors, or adversarial misuse, require different mechanisms and are outside the scope of this work\. Our Bayesian formulation is intended to characterize observable coordination behavior, rather than to make claims about the internal representations of LLMs\. Theϵ\\epsilon\-closeness assumption in Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)is empirically supported in the homogeneous workflow setting studied in this paper \(Appendix[D\.1](https://arxiv.org/html/2605.24197#A4.SS1)\), while heterogeneous workflows with different model families may induce different role\-inference dynamics\. AEA also adds a modest inference cost, about 6% additional tokens in our experiments \(Appendix[D\.3](https://arxiv.org/html/2605.24197#A4.SS3)\), and is expected to be most useful when errors are caused by role coordination rather than insufficient reasoning depth\. Finally, the effectiveness of AEA depends on the extraction functionℱ\\mathcal\{F\}producing evidence that satisfies the Information Gain Condition\. When the extracted evidence is not informative for the current workflow, AEA should be viewed as a diagnostic and corrective framework rather than a guaranteed solution to all agentic failures\.

## Appendix CProofs of Theoretical Results

In this section, we provide full proofs for the theorems and corollaries presented in Section[3](https://arxiv.org/html/2605.24197#S3)and[4](https://arxiv.org/html/2605.24197#S4)of the main paper\.

### C\.1Proof of Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)

Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)\(ϵ\\epsilon\-Stability of Role Posteriors\)\.Leti,j∈𝒩i,j\\in\\mathcal\{N\}be two distinct roles\. Assume the priors areϵπ\\epsilon\_\{\\pi\}\-close:∥P\(⋅∣i\)−P\(⋅∣j\)∥TV≤ϵπ\\\|P\(\\cdot\\mid i\)\-P\(\\cdot\\mid j\)\\\|\_\{\\mathrm\{TV\}\}\\leq\\epsilon\_\{\\pi\}and the likelihoods areϵℓ\\epsilon\_\{\\ell\}\-close\. Furthermore, assume the workflow evidenceYYis sufficiently informative such that the marginal likelihood is lower\-bounded byζ\>0\\zeta\>0for both roles\. Then, the posterior distributions areδ\\delta\-close, whereδ=κ\(2ϵπ\+ϵℓ\)\\delta=\\kappa\(2\\epsilon\_\{\\pi\}\+\\epsilon\_\{\\ell\}\)andκ∝ζ−1\\kappa\\propto\\zeta^\{\-1\}\. Consequently, the probability that agentsiiandjjselect distinct actions is upper\-bounded:

P\(π^i\(Y\)≠π^j\(Y\)\)≤δ\.P\(\\hat\{\\pi\}\_\{i\}\(Y\)\\neq\\hat\{\\pi\}\_\{j\}\(Y\)\)\\leq\\delta\.\(9\)If the oracle requirements differ \(πi∗≠πj∗\\pi^\{\*\}\_\{i\}\\neq\\pi^\{\*\}\_\{j\}\), the probability of misalignment is lower\-bounded by1−δ1\-\\delta\.

###### Proof\.

We derive the bound on the posterior distance\. LetPk\(θ\)P\_\{k\}\(\\theta\)andLk\(θ\)L\_\{k\}\(\\theta\)denote the prior and likelihood for agentk∈\{i,j\}k\\in\\\{i,j\\\}\. The posterior is given by Bayes’ rule:P\(θ\|Y,k\)=Lk\(θ\)Pk\(θ\)/ZkP\(\\theta\|Y,k\)=L\_\{k\}\(\\theta\)P\_\{k\}\(\\theta\)/Z\_\{k\}, whereZk=∫Lk\(θ\)𝑑Pk\(θ\)Z\_\{k\}=\\int L\_\{k\}\(\\theta\)dP\_\{k\}\(\\theta\)is the evidence\.

We first bound the difference in evidence\|Zi−Zj\|\|Z\_\{i\}\-Z\_\{j\}\|\. Using the linearity of the integral and the triangular inequality:

\|Zi−Zj\|≤\|∫Lid\(Pi−Pj\)\|\+\|∫\(Li−Lj\)𝑑Pj\|\.\|Z\_\{i\}\-Z\_\{j\}\|\\leq\|\\int L\_\{i\}d\(P\_\{i\}\-P\_\{j\}\)\|\+\|\\int\(L\_\{i\}\-L\_\{j\}\)dP\_\{j\}\|\.\(10\)Assuming likelihoods are bounded byMM, the first term is bounded by2Mϵπ2M\\epsilon\_\{\\pi\}and the second byϵℓ\\epsilon\_\{\\ell\}\. Thus, the evidences are close\.

Next, we apply the perturbation bound for the ratioAa−Bb=A−Ba\+Bb−aab\\frac\{A\}\{a\}\-\\frac\{B\}\{b\}=\\frac\{A\-B\}\{a\}\+B\\frac\{b\-a\}\{ab\}\. LettingA=LiPiA=L\_\{i\}P\_\{i\}anda=Zia=Z\_\{i\}, we obtain:

∥P\(⋅\|i\)−P\(⋅\|j\)∥TV≤1Zi∥LiPi−LjPj∥TV\+\|Zj−Zi\|ZiZj\.\\\|P\(\\cdot\|i\)\-P\(\\cdot\|j\)\\\|\_\{\\mathrm\{TV\}\}\\leq\\frac\{1\}\{Z\_\{i\}\}\\\|L\_\{i\}P\_\{i\}\-L\_\{j\}P\_\{j\}\\\|\_\{\\mathrm\{TV\}\}\+\\frac\{\|Z\_\{j\}\-Z\_\{i\}\|\}\{Z\_\{i\}Z\_\{j\}\}\.\(11\)Given the lower bound assumptionZk≥ζZ\_\{k\}\\geq\\zeta, the inverse terms are bounded byζ−1\\zeta^\{\-1\}\. Grouping the error termsϵπ\\epsilon\_\{\\pi\}andϵℓ\\epsilon\_\{\\ell\}yields a linear dependency scaled by the condition numberκ≈Mζ\\kappa\\approx\\frac\{M\}\{\\zeta\}\. Thus, strong generic priors \(smallϵπ\\epsilon\_\{\\pi\}\) and weak evidence \(smallζ\\zeta\) ensure posteriors remainδ\\delta\-close, leading to generic mode collapse\. Hence, according to the definition of TV distance, we have the probability that agentiiandjjselect distinct actions is also upper\-bounded byδ\\delta\. ∎

### C\.2Proof of Theorem[4\.1](https://arxiv.org/html/2605.24197#S4.Thmtheorem1)

Theorem[4\.1](https://arxiv.org/html/2605.24197#S4.Thmtheorem1)\(Lower Bound on Misalignment Error\)\.Assume the action space has cardinality\|𝒜i\|=M\|\\mathcal\{A\}\_\{i\}\|=M\. Leta^\(Y\)\\hat\{a\}\(Y\)be any decision rule based on evidenceYY\. The probability of decisive error is lower\-bounded by Fano’s Inequality:

Pe≥1−I\(L∗;Y\)\+log⁡2log⁡M,P\_\{e\}\\geq 1\-\\frac\{I\(L^\{\*\};Y\)\+\\log 2\}\{\\log M\},\(12\)whereI\(L∗;Y\)I\(L^\{\*\};Y\)is the mutual information between the true optimal action label and the evidence\.

###### Proof\.

We model the agent’s decision process as a Markov chainL∗→Y→L^L^\{\*\}\\to Y\\to\\hat\{L\}, whereL∗L^\{\*\}is the true optimal action andL^\\hat\{L\}is the agent’s estimated action\. Fano’s Inequality relates the probability of errorPe=P\(L^≠L∗\)P\_\{e\}=P\(\\hat\{L\}\\neq L^\{\*\}\)to the conditional entropyH\(L∗∣Y\)H\(L^\{\*\}\\mid Y\):

H\(L∗∣Y\)≤Hb\(Pe\)\+Pelog⁡\(M−1\),H\(L^\{\*\}\\mid Y\)\\leq H\_\{b\}\(P\_\{e\}\)\+P\_\{e\}\\log\(M\-1\),\(13\)whereHb\(Pe\)H\_\{b\}\(P\_\{e\}\)is the binary entropy function\. We apply two bounds: 1\.Hb\(Pe\)≤log⁡2H\_\{b\}\(P\_\{e\}\)\\leq\\log 2\(the maximum entropy of a binary variable\)\. 2\.log⁡\(M−1\)<log⁡M\\log\(M\-1\)<\\log M\. Substituting these yields:

H\(L∗∣Y\)≤log⁡2\+Pelog⁡M\.H\(L^\{\*\}\\mid Y\)\\leq\\log 2\+P\_\{e\}\\log M\.\(14\)By the definition of mutual information,H\(L∗∣Y\)=H\(L∗\)−I\(L∗;Y\)H\(L^\{\*\}\\mid Y\)=H\(L^\{\*\}\)\-I\(L^\{\*\};Y\)\. Assuming a uniform prior over the optimal actions \(representing the worst\-case uncertainty\),H\(L∗\)=log⁡MH\(L^\{\*\}\)=\\log M\. We substitute this into the inequality:

log⁡M−I\(L∗;Y\)≤log⁡2\+Pelog⁡M\.\\log M\-I\(L^\{\*\};Y\)\\leq\\log 2\+P\_\{e\}\\log M\.\(15\)Rearranging terms to solve forPeP\_\{e\}:

Pelog⁡M\\displaystyle P\_\{e\}\\log M≥log⁡M−I\(L∗;Y\)−log⁡2\\displaystyle\\geq\\log M\-I\(L^\{\*\};Y\)\-\\log 2\(16\)Pe\\displaystyle P\_\{e\}≥1−I\(L∗;Y\)\+log⁡2log⁡M\.\\displaystyle\\geq 1\-\\frac\{I\(L^\{\*\};Y\)\+\\log 2\}\{\\log M\}\.\(17\)This completes the proof\. ∎

### C\.3Proof of Corollary[4\.2](https://arxiv.org/html/2605.24197#S4.Thmtheorem2)

Corollary[4\.2](https://arxiv.org/html/2605.24197#S4.Thmtheorem2)\(Information Gain Condition\)\.LLetPe\(Y\)P\_\{e\}\(Y\)andPe\(Y′\)P\_\{e\}\(Y^\{\\prime\}\)be the probabilities of decisive error under baseline and AEA evidence, respectively\. A necessary condition for AEA to strictly reduce the probabilities on misalignment \(i\.e\.,Pe\(Y′\)<Pe\(Y\)P\_\{e\}\(Y^\{\\prime\}\)<P\_\{e\}\(Y\)\) is

I\(L∗;Et∣Yt\)\>0I\(L^\{\*\};E\_\{t\}\\mid Y\_\{t\}\)\>0\(18\)
###### Proof\.

We compare the mutual information terms in the lower bound from Theorem[4\.1](https://arxiv.org/html/2605.24197#S4.Thmtheorem1)\. The updated evidence isY′=\(Y,E\)Y^\{\\prime\}=\(Y,E\)\. By the chain rule for mutual information:

I\(L∗;Y,E\)=I\(L∗;Y\)\+I\(L∗;E∣Y\)\.I\(L^\{\*\};Y,E\)=I\(L^\{\*\};Y\)\+I\(L^\{\*\};E\\mid Y\)\.\(19\)For the error bound to decrease, the subtracted term in Fano’s inequality must increase:

I\(L∗;Y′\)\+log⁡2log⁡M\>I\(L∗;Y\)\+log⁡2log⁡M\.\\frac\{I\(L^\{\*\};Y^\{\\prime\}\)\+\\log 2\}\{\\log M\}\>\\frac\{I\(L^\{\*\};Y\)\+\\log 2\}\{\\log M\}\.\(20\)This simplifies toI\(L∗;Y′\)\>I\(L∗;Y\)I\(L^\{\*\};Y^\{\\prime\}\)\>I\(L^\{\*\};Y\)\. Substituting the chain rule expansion, we require:

I\(L∗;Y\)\+I\(L∗;E∣Y\)\>I\(L∗;Y\)⟹I\(L∗;E∣Y\)\>0\.I\(L^\{\*\};Y\)\+I\(L^\{\*\};E\\mid Y\)\>I\(L^\{\*\};Y\)\\implies I\(L^\{\*\};E\\mid Y\)\>0\.\(21\)Thus, the agentic evidenceEEmust provide strictly positive information about the optimal actionL∗L^\{\*\}conditioned on the existing workflowYY\. ∎

### C\.4Proof of Theorem[4\.3](https://arxiv.org/html/2605.24197#S4.Thmtheorem3)

Theorem[4\.3](https://arxiv.org/html/2605.24197#S4.Thmtheorem3)\(Covariance Contraction under Verifiable Evidence\)\.Under the linear\-Gaussian assumption, the posterior covarianceΣY′\\Sigma\_\{Y^\{\\prime\}\}after observing AEA evidenceEtE\_\{t\}satisfies the Loewner partial orderΣY′⪯ΣY\\Sigma\_\{Y^\{\\prime\}\}\\preceq\\Sigma\_\{Y\}\. Specifically, the precision matrix increases by the Fisher information of the evidence:

ΣY′−1=ΣY−1\+𝐇⊤𝐑−1𝐇\.\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}=\\Sigma\_\{Y\}^\{\-1\}\+\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}\.\(22\)
###### Proof\.

We rely on the standard Bayesian Linear Regression update equations for the precision matrix \(inverse covariance\)\. LetΛY=ΣY−1\\Lambda\_\{Y\}=\\Sigma\_\{Y\}^\{\-1\}be the precision of the belief given baseline evidence\. The likelihood of the new evidenceEtE\_\{t\}is Gaussian with covariance𝐑\\mathbf\{R\}\. The posterior precisionΛY′\\Lambda\_\{Y^\{\\prime\}\}is given by the sum of the prior precision and the observation precision \(Fisher information\):

ΣY′−1=ΣY−1\+𝐇⊤𝐑−1𝐇\.\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}=\\Sigma\_\{Y\}^\{\-1\}\+\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}\.\(23\)To prove contraction \(ΣY′⪯ΣY\\Sigma\_\{Y^\{\\prime\}\}\\preceq\\Sigma\_\{Y\}\), we must show thatΣY−ΣY′\\Sigma\_\{Y\}\-\\Sigma\_\{Y^\{\\prime\}\}is positive semi\-definite \(PSD\)\. Equivalently, using the property that ifA⪰B≻0A\\succeq B\\succ 0thenB−1⪰A−1B^\{\-1\}\\succeq A^\{\-1\}, we showΣY′−1⪰ΣY−1\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}\\succeq\\Sigma\_\{Y\}^\{\-1\}\. Consider the term𝐐=𝐇⊤𝐑−1𝐇\\mathbf\{Q\}=\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}\. Since𝐑\\mathbf\{R\}is a valid covariance matrix, it is symmetric positive definite \(𝐑≻0\\mathbf\{R\}\\succ 0\), implying𝐑−1≻0\\mathbf\{R\}^\{\-1\}\\succ 0\. For any non\-zero vector𝐯∈ℝd\\mathbf\{v\}\\in\\mathbb\{R\}^\{d\}:

𝐯⊤𝐐𝐯=𝐯⊤𝐇⊤𝐑−1𝐇𝐯=\(𝐇𝐯\)⊤𝐑−1\(𝐇𝐯\)\.\\mathbf\{v\}^\{\\top\}\\mathbf\{Q\}\\mathbf\{v\}=\\mathbf\{v\}^\{\\top\}\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}\\mathbf\{v\}=\(\\mathbf\{H\}\\mathbf\{v\}\)^\{\\top\}\\mathbf\{R\}^\{\-1\}\(\\mathbf\{H\}\\mathbf\{v\}\)\.\(24\)Let𝐳=𝐇𝐯\\mathbf\{z\}=\\mathbf\{H\}\\mathbf\{v\}\. Then𝐳⊤𝐑−1𝐳≥0\\mathbf\{z\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{z\}\\geq 0because𝐑−1\\mathbf\{R\}^\{\-1\}is positive definite\. Thus,𝐐⪰0\\mathbf\{Q\}\\succeq 0\. SinceΣY′−1=ΣY−1\+𝐐\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}=\\Sigma\_\{Y\}^\{\-1\}\+\\mathbf\{Q\}and𝐐⪰0\\mathbf\{Q\}\\succeq 0, it follows thatΣY′−1⪰ΣY−1\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}\\succeq\\Sigma\_\{Y\}^\{\-1\}\. By inversion monotonicity,ΣY′⪯ΣY\\Sigma\_\{Y^\{\\prime\}\}\\preceq\\Sigma\_\{Y\}\. ∎

### C\.5Proof of Corollary[4\.4](https://arxiv.org/html/2605.24197#S4.Thmtheorem4)

Corollary[4\.4](https://arxiv.org/html/2605.24197#S4.Thmtheorem4)\(Reliability Improvement via Variance Reduction\)\.Consider two actionsa1,a2a\_\{1\},a\_\{2\}with feature differenceδ=ϕ\(s,a1\)−ϕ\(s,a2\)\\delta=\\phi\(s,a\_\{1\}\)\-\\phi\(s,a\_\{2\}\)under statess\. Ifa1a\_\{1\}is optimal \(𝐰⊤δ\>0\\mathbf\{w\}^\{\\top\}\\delta\>0\) and the evidenceEtE\_\{t\}is unbiased \(does not decrease the margin meanμ\\mu\), thenPe\(Y′\)≤Pe\(Y\)P\_\{e\}\(Y^\{\\prime\}\)\\leq P\_\{e\}\(Y\)\.

###### Proof\.

Let the random variableD=𝐰⊤δD=\\mathbf\{w\}^\{\\top\}\\deltarepresent the projected utility difference\. Under the posteriorYY,DDfollows a univariate Gaussian distribution:

D∣Y∼𝒩\(μD,σD2\),whereμD=𝝁Y⊤δ,σD2=δ⊤ΣYδ\.D\\mid Y\\sim\\mathcal\{N\}\(\\mu\_\{D\},\\sigma\_\{D\}^\{2\}\),\\text\{where \}\\mu\_\{D\}=\\boldsymbol\{\\mu\}\_\{Y\}^\{\\top\}\\delta,\\sigma\_\{D\}^\{2\}=\\delta^\{\\top\}\\Sigma\_\{Y\}\\delta\.\(25\)The agent commits an error if it samples a utility vector𝐰\\mathbf\{w\}such thata2a\_\{2\}appears better thana1a\_\{1\}, i\.e\.,D<0D<0\. The probability of this event is given by the Cumulative Distribution Function \(CDF\) of the standard normal distribution,Φ\\Phi:

Pe\(Y\)=P\(D<0∣Y\)=Φ\(−μDσD\)\.P\_\{e\}\(Y\)=P\(D<0\\mid Y\)=\\Phi\(\-\\frac\{\\mu\_\{D\}\}\{\\sigma\_\{D\}\}\)\.\(26\)From Theorem[4\.3](https://arxiv.org/html/2605.24197#S4.Thmtheorem3), we knowΣY′⪯ΣY\\Sigma\_\{Y^\{\\prime\}\}\\preceq\\Sigma\_\{Y\}\. By the definition of PSD matrices, this implies variance reduction along any directionδ\\delta:

σD′2=δ⊤ΣY′δ≤δ⊤ΣYδ=σD2\.\\sigma\_\{D^\{\\prime\}\}^\{2\}=\\delta^\{\\top\}\\Sigma\_\{Y^\{\\prime\}\}\\delta\\leq\\delta^\{\\top\}\\Sigma\_\{Y\}\\delta=\\sigma\_\{D\}^\{2\}\.\(27\)LetσD′≤σD\\sigma\_\{D^\{\\prime\}\}\\leq\\sigma\_\{D\}\. We assume unbiased evidence such that the posterior mean margin does not degrade:μD′≥μD\>0\\mu\_\{D^\{\\prime\}\}\\geq\\mu\_\{D\}\>0\. We analyze the argument of the CDFΦ\(x\)\\Phi\(x\)\. SinceμD\>0\\mu\_\{D\}\>0andσD\>0\\sigma\_\{D\}\>0:

μD′σD′≥μDσD⟹−μD′σD′≤−μDσD\.\\frac\{\\mu\_\{D^\{\\prime\}\}\}\{\\sigma\_\{D^\{\\prime\}\}\}\\geq\\frac\{\\mu\_\{D\}\}\{\\sigma\_\{D\}\}\\implies\-\\frac\{\\mu\_\{D^\{\\prime\}\}\}\{\\sigma\_\{D^\{\\prime\}\}\}\\leq\-\\frac\{\\mu\_\{D\}\}\{\\sigma\_\{D\}\}\.\(28\)SinceΦ\(z\)\\Phi\(z\)is a strictly monotonically increasing function, a smaller \(more negative\) argument yields a smaller probability:

Φ\(−μD′σD′\)≤Φ\(−μDσD\)\.\\Phi\(\-\\frac\{\\mu\_\{D^\{\\prime\}\}\}\{\\sigma\_\{D^\{\\prime\}\}\}\)\\leq\\Phi\(\-\\frac\{\\mu\_\{D\}\}\{\\sigma\_\{D\}\}\)\.\(29\)Thus,Pe\(Y′\)≤Pe\(Y\)P\_\{e\}\(Y^\{\\prime\}\)\\leq P\_\{e\}\(Y\)\. ∎

## Appendix DQuantitative Analyses of Posterior Collapse and AEA

The Bayesian framework in Section[3](https://arxiv.org/html/2605.24197#S3)produces several behavioral predictions that go beyond final\-task accuracy\. In this section, we report a series of controlled analyses that probe these predictions on the same multi\-agent setup as Section[5\.3](https://arxiv.org/html/2605.24197#S5.SS3): how different roles converge to a shared policy, whether the evidence injected by AEA carries new information about the optimal action, how the posterior contracts under AEA, how AEA scales with workflow complexity, and how AEA compares against pure test\-time scaling on the compute axis\. These analyses characterizehowAEA changes agent behavior, not onlywhetherit improves accuracy\.

### D\.1Posterior Collapse

Setup\.We test whether unaligned agents actually behave as if they share a single policy across roles, the prediction made by Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)\. To make this prediction directly testable, we instantiate a three\-role workflow onAIME24with maximally distinct roles: aProblem Solverthat solves the problem from scratch, aSolution Verifierthat checks the proposed solution*without re\-solving it*, and aResult Reporterthat only formats the final answer\. We measure two behavioral quantities, both judged by GPT\-4o with fixed rubrics\.Pairwise Functional Overlapmeasures, for every ordered pair of agents, how often they perform functionally equivalent operations across 90 comparisons; this serves as a behavioral proxy forP\(π^i=π^j\)P\(\\hat\{\\pi\}\_\{i\}=\\hat\{\\pi\}\_\{j\}\)\.Role Action Accuracymeasures the fraction of turns whose action matches the assigned role description; a 50\-sample human spot\-check shows 96% agreement with the judge\.

Table 4:Behavioral signature of posterior collapse onAIME24with three maximally distinct roles\. Without alignment, agents converge to a shared generic policy with high functional overlap and low role action accuracy, empirically confirming Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)\. AEA breaks this convergence, and Weak\-to\-Strong evidence is the most effective\.Unaligned agents collapse onto a shared policy\.As shown in Table[4](https://arxiv.org/html/2605.24197#A4.T4), even with maximally distinct role specifications, unaligned agents exhibit 58\.3% functional overlap while only 34\.2% of turns are role\-aligned\. This is exactly the behavior predicted by Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2): underϵ\\epsilon\-close priors and likelihoods, the dominant pre\-training prior pulls all agents toward the same generic policy, regardless of their nominal role\. Self\-Reflection only mildly attenuates the collapse \(47\.1% overlap, 42\.8% role accuracy\), because the reflecting agent inherits the same prior that produced the collapse in the first place\. In contrast, Weak\-to\-Strong AEA halves the overlap to 29\.5% and nearly doubles role action accuracy to 59\.6%\. This empirically confirms that posterior collapse is the operative failure mode, and that breaking it requires a prior\-distinct evidence channel rather than additional same\-prior reasoning\.

### D\.2Posterior Contraction with AEA

Setup\.Theorem[4\.3](https://arxiv.org/html/2605.24197#S4.Thmtheorem3)predicts that AEA tightens the posterior over the agent’s latent utility,ΣY′⪯ΣY\\Sigma\_\{Y^\{\\prime\}\}\\preceq\\Sigma\_\{Y\}\. As an observable surrogate for this contraction, we re\-run each of 30 randomly sampled tasksk=5k=5times under each setting, embed every final answer with a fixed sentence encoder, and measure the per\-task variance of the embeddings around their per\-task mean\.

Table 5:Per\-task variance of final\-answer embeddings across3030tasks×\\times55repetitions\. AEA roughly halves the variance, the behavioral analogue of the closed\-form variance reductionΣY′−1=ΣY−1\+𝐇⊤𝐑−1𝐇\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}=\\Sigma\_\{Y\}^\{\-1\}\+\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}in Theorem[4\.3](https://arxiv.org/html/2605.24197#S4.Thmtheorem3)\.Variance reduction matches the closed\-form prediction\.As shown in Table[5](https://arxiv.org/html/2605.24197#A4.T5), the unaligned workflow yields a per\-task variance of 0\.0847, indicating that repeated runs of the same task produce systematically different answers, the empirical signature of a broad posterior\. Self\-Reflection contracts the variance modestly to 0\.0691, since same\-prior reflection only weakly constrains the posterior\. Weak\-to\-Strong AEA roughly halves the unaligned variance, reaching 0\.0423\. This is the behavioral experiment of the closed\-form variance reductionΣY′−1=ΣY−1\+𝐇⊤𝐑−1𝐇\\Sigma\_\{Y^\{\\prime\}\}^\{\-1\}=\\Sigma\_\{Y\}^\{\-1\}\+\\mathbf\{H\}^\{\\top\}\\mathbf\{R\}^\{\-1\}\\mathbf\{H\}in Theorem[4\.3](https://arxiv.org/html/2605.24197#S4.Thmtheorem3): external verifiable evidence projects the latent utility onto observable constraints, sharpens the posterior, and produces more reproducible answers across runs\.

### D\.3AEA Outperforms Test\-Time Scaling at Lower Cost

Setup\.A natural alternative to AEA is to spend the same additional compute on test\-time scaling, such as majority voting or best\-of\-KKsampling\[[5](https://arxiv.org/html/2605.24197#bib.bib5),[32](https://arxiv.org/html/2605.24197#bib.bib32)\]\. To compare them on a common compute axis, we evaluate AEA against single\-agent test\-time scaling baselines onAIME24,AIME25, andDataBenchusing GPT\-4o, and report the average total token consumption\.

Table 6:Cost\-normalized comparison on GPT\-4o\. AEA adds only∼\\sim6% token overhead, yet outperforms a strictly larger Best\-of\-KKbudget on two of three benchmarks and matches it on the third\.Compute cannot substitute for evidence\.As shown in Table[6](https://arxiv.org/html/2605.24197#A4.T6), AEA adds only∼\\sim6% token overhead over the unaligned multi\-agent workflow \(∼\\sim12,100 vs\.∼\\sim11,400\)\. Even when test\-time scaling is given a strictly larger compute budget \(Best\-of\-KKat∼\\sim14,250 tokens\), AEA still outperforms it onAIME24\(40\.0 vs\. 30\.0\) andDataBench\(35\.0 vs\. 33\.0\), and matches it onAIME25\. The gain from AEA cannot be replicated by raw test\-time compute\. The bottleneck in these workflows is role coordination rather than reasoning depth, and pumping more samples through the same pre\-training prior simply produces more confident misaligned answers, not aligned ones\. This empirically supports the central claim that evidence\-conditioned alignment is a more efficient lever than naive scaling\.

### D\.4AEA Gains Scale with Workflow Complexity

Setup\.The Discussion in Appendix[F](https://arxiv.org/html/2605.24197#A6)argues that each additional agent enlarges the misalignment surface area, since each new agent introduces another decision node where the posterior may collapse\. We test this prediction by grouping all GPT\-4o evaluations from Table[1](https://arxiv.org/html/2605.24197#S4.T1)by the number of agents the orchestrator \(CaptainAgent\) instantiated for each task, and reporting the unaligned and AEA\-aligned accuracy in each bucket\.

Table 7:Effect of workflow complexity on GPT\-4o\. The unaligned baseline degrades as the number of agents grows, while the AEA gain rises monotonically\. The largest gains appear in the most complex workflows\.The misalignment surface area widens with workflow size\.As shown in Table[7](https://arxiv.org/html/2605.24197#A4.T7), two trends emerge together\. First, the unaligned multi\-agent accuracy drops monotonically with agent count \(52\.3→\\to41\.5→\\to33\.3\), confirming that scaling the workflow without alignment is actively harmful\. Second, the AEA gainΔ\\Deltarises monotonically \(\+5\.5→\+11\.1→\+16\.7\+5\.5\\to\+11\.1\\to\+16\.7\)\. The benefit of AEA therefore concentrates exactly where the unaligned baseline degrades the most\. This is the empirical realization of the misalignment\-surface\-area argument: larger workflows have more decision nodes that can suffer from posterior collapse, and the value of an evidence channel grows accordingly\.

## Appendix EStandard Reward Evaluations

Our AEA evidence model is trained to produce role\-specific, structured feedback for multi\-agent traces\. Since this training objective differs from standard reward modeling, we additionally evaluate whether the resulting model still behaves like a strong general\-purpose reward reasoning model on widely used reward benchmarks\. Concretely, we report results on RewardBench\[[18](https://arxiv.org/html/2605.24197#bib.bib18)\]and RM\-Bench\[[20](https://arxiv.org/html/2605.24197#bib.bib20)\], which test preference judgment quality across chat helpfulness, safety, and reasoning, as well as math and code domains\.Note that this evaluation is not our main target but only to verify the evidence model aligns with similar preference to human\.

Tables[8](https://arxiv.org/html/2605.24197#A5.T8)and[9](https://arxiv.org/html/2605.24197#A5.T9)show that AEA\-4B \(trained with supervised warm\-start followed by GRPO on our agentic trace data\) maintains competitive performance compared to its underlying Qwen3 base model and to RM\-R1 models\[[8](https://arxiv.org/html/2605.24197#bib.bib8)\]trained directly for reward reasoning\. In particular, AEA\-4B remains strong on the safety and reasoning categories, indicating that specializing the model for agentic evidence attribution does not collapse its standard preference evaluation ability\. Overall, these results support the view that AEA can be used as a plug\-in evidence model for workflow alignment without sacrificing the reward evaluation capability that is typically expected from reward reasoning models\.

Table 8:Reward Evaluation on Reward Reasoning Model \(RewardBench\)Table 9:Reward Evaluation on Reward Reasoning Model \(RM\-Bench\)
## Appendix FDiscussion

The Efficiency\-Reliability Trade\-off\.Current trends in agentic AI emphasize scaling by increasing the number of agents and interaction depth\[[16](https://arxiv.org/html/2605.24197#bib.bib16)\]\. However, our results highlight a critical limitation\. Scaling without alignment introduces a “curse of dimensionality” in reliability\. Each additional agent introduces a new decision node where the generic posterior may fail\. This effectively expands the misalignment surface area, defined as the cumulative probability that a local role deviation propagates into a global failure\. AEA fundamentally alters this trade\-off\. It converts minor additional compute into posterior variance reduction\. We find that paying this inference cost is far more efficient than naive scaling\. Future architectures should prioritize thedensity of evidence, ensuringI\(L∗;E∣Y\)\>0I\(L^\{\*\};E\\mid Y\)\>0\.

Capability vs\. Evidence\.Our Bayesian framework provides a rigorous diagnostic tool for distinguishing between two fundamental failure modes\. When AEA intervention leads to success \(aFixedoutcome\), we can retrospectively attribute the initial failure tomissing evidencewhere the agent had the capability to solve the task but inferred the wrong utility function due to weak signals\. Conversely, failures that persist even after AEA intervention likely representmissing capability, indicating fundamental deficits in reasoning or domain knowledge\. This separation clarifies whether resources should be allocated to pre\-training stronger base models \(to fix capability\) or to engineering better runtime context and attribution signals \(to fix evidence\)\.

Relation to Concurrent Works\.With the rise of LLM\-based agents, automated workflow generation has become a common way to build multi\-agent systems \(MAS\)\. In parallel to our work, recent papers have started to study two directions that are necessary for making these systems reliable: \(i\) measuring and localizing failures inside a workflow, and \(ii\) understanding when scaling the number of agents and interaction steps actually helps\. First,Zhang et al\. \[[46](https://arxiv.org/html/2605.24197#bib.bib46)\]introduceWho&When, a benchmark that targetsautomatic failure attributionin LLM multi\-agent systems by asking models to identify which agent caused the failure and when it happened\. Related efforts move from benchmarking to data generation:Zhang et al\. \[[44](https://arxiv.org/html/2605.24197#bib.bib44)\]propose an automated pipeline for analyzing failed trajectories and producing structured annotations that can be used for debugging or training attribution models\. In a complementary direction,Kim et al\. \[[16](https://arxiv.org/html/2605.24197#bib.bib16)\]study the scaling behavior of agent systems and highlight that adding agents or steps is not a free win\. Improvements depend on how coordination is organized and where errors accumulate in the workflow\. Beyond these, there are also concurrent diagnostics\[[22](https://arxiv.org/html/2605.24197#bib.bib22)\]that analyze root causes in orchestrated agent platforms, which further supports the need for explicit failure localization rather than only scaling compute\. Our goal is different from attribution or scaling alone: we argue that both should ultimately serveagentic alignment\. Concretely, we provide a bridge between \(a\) identifying decisive errors inside a trajectory and \(b\) correcting future decisions by changing the evidence available to the agent\. This is why our framework treats attribution outputs asevidencethat contracts an agent’s latent utility posterior, rather than as a standalone diagnosis\. Empirically, this lets us turn failure attribution into a practical alignment mechanism \(AEA\), and it also explains why naive self\-reflection can fail under a shared prior while weak\-to\-strong evidence can remain discriminative\.

Table 10:Datasets used to generate agentic reasoning traces for training Agentic Evidence Attribution \(AEA\)\. The first three benchmarks \(GAIA,AssistantBench,LiveBench\) are single\-agent evaluations that we convert into multi\-agent runs, whileWho&Whenis a native multi\-agent failure attribution benchmark\. We then constructOurs, a larger multi\-agent trace dataset with misalignment diagnostics and human\-verified annotations\.
## Appendix GPrompt Engineering in Evaluation

The following prompts constitute the core interface for our evaluation pipeline\. They are designed to strictly enforce structured outputs \(JSON\) to facilitate automated parsing and downstream feedback injection\.

### G\.1AEA System Prompt

This prompt serves as the inference interface for our fine\-tuned evidence model \(AEA\-4B\)\. It is designed to take the raw, multi\-turn conversation history of a workflow and extract a structured failure attribution\. Unlike standard chat prompts, this system instruction enforces a strict JSON schema that maps directly to the “decisive error” definition: identifying the specific agent role and step number where the trajectory diverged from optimality\.

AEA System PromptYou are an AI assistant that analyzes multi\-agent conversations\.Your task is to analyze the conversation and identify issues with agent collaboration\.You MUST respond with a valid JSON object in this exact format:\{"rating": 1\-5, \} \{"agent\_name": "<agent\>",\} \{"step\_number": <int\>,\} \{"reason": "<text\>",\} \{"revised\_prompt": "<new prompt\>"\}Analysis guidelines:Rate collaboration 1\-5 \(1=poor, 5=excellent\)\. Identify ONE agent that made the most critical mistake and the step number where it occurred\. Note: Computer\_terminal is not an agent\. Provide a brief reason and improved system prompt\.

### G\.2Self\-Reflection Prompt

To evaluate the “Dominant Prior” hypothesis, we utilize a Self\-Reflection prompt that queries the backbone model \(e\.g\., GPT\-4o, Claude\) to critique its own generated trace\. This prompt mirrors the schema of the AEA prompt to ensure a fair comparison\. Crucially, the model is provided with the full task context—including the problem statement, ground truth \(for experimental analysis\), and the agent’s final answer—and acts as a reviewer to detect the first instance of critical failure\.

Self\-Reflection Prompt TemplateYou are analyzing a multi\-agent conversation to identify collaboration issues and provide improvement suggestions\.Task Context Problem: \{problem\} Expected Answer: \{ground\_truth\} Agent’s Final Answer: \{agent\_answer\}Multi\-Agent Conversation \{formatted\_history\}Your Analysis Task Analyze this multi\-agent conversation and identify:∙\\bulletRate the overall collaboration quality \(1\-5\) ∙\\bulletIdentify the ONE agent that made the most critical mistake ∙\\bulletFind the step number where the mistake first occurred ∙\\bulletProvide a brief reason explaining the issue ∙\\bulletSuggest an improved system prompt for that agent Note: “computer\_terminal” is a tool, not an agent\. Focus on actual agents\.Required Output Format Respond with a valid JSON object containing: rating, agent\_name, step\_number, reason, revised\_prompt

### G\.3LLM\-as\-a\-Judge Prompt

To standardize performance metrics across our diverse benchmark suite, we employ a strong generalist model \(GPT\-4o\) as an objective judge\. This prompt converts the potentially verbose or unstructured outputs of the multi\-agent system into a binary correctness label\. The prompt is conditioned on the specific domain logic \(e\.g\., checking mathematical derivations forAIMEvs\. code execution signatures forHumanEval\) to ensure high\-fidelity evaluation\.

LLM Judge Prompt TemplateYou are evaluating a \{solution\_type\} from a multi\-agent system\. Please determine if the final answer is correct\.Problem:\{question\}Expected Answer:\{expected\_answer\}Multi\-Agent Response:\{model\_response\}Please evaluate if the multi\-agent response correctly solves the problem\. Look for:∙\\bulletAIME:Correct reasoning, final numerical answer, valid approach ∙\\bulletHumanEval:Correct signature, implementation logic, edge cases ∙\\bulletDataBench:Correct answer, data analysis approach, reasoning ∙\\bulletSciBench:Correct scientific reasoning, calculations, principlesRespond with exactly one of:CORRECT\(solution is right\) orINCORRECT\(solution is wrong\)

![Refer to caption](https://arxiv.org/html/2605.24197v1/figures/conversation_graph_0.png)\(a\)AEA on Claude 3 Haiku \(DataBench\):AEA successfully flags a decisive error at Step 2 \(Red\), assigning a low rating \(2/5\) and preventing misalignment propagation\.Red boarderdenotes the groundtruth misaligned step\.
![Refer to caption](https://arxiv.org/html/2605.24197v1/figures/conversation_graph_1.png)\(b\)Self\-Reflection on GPT\-4o \(AIME\):The system exhibitsHallucinated Compliance\. Despite an issue at Step 1, the shared prior leads to a perfect rating \(5/5\), failing to catch the error\.Red boarderdenotes the groundtruth misaligned step\.

Figure 4:Topological comparison of failure modes\.\(a\)AEA acts as a variance reducer, providing orthogonal evidence to identify latent failures\.\(b\)Self\-Reflection suffers from posterior collapse, validating incorrect trajectories due to the Dominant Prior\.

## Appendix HVisualization on Topological Graphs

To scrutinize the structural dynamics of misalignment, we visualize the workflow execution topology in Figure[4](https://arxiv.org/html/2605.24197#A7.F4)\. These graphs map the agent interaction network against the temporal conversation flow, allowing us to trace the propagation of decisive errors\.

AEA enables precise attribution\.Figure[4](https://arxiv.org/html/2605.24197#A7.F4)\(Top\) displays aDataBenchworkflow using Claude 3 Haiku aligned with AEA\. The system encounters a decisive error at Step 2 by theDating Relationship Agent\. In a standard workflow, this error propagates past theAnalysis Checker\(Step 3\)\. However, AEA correctly identifies the latent failure, assigning a low alignment rating \(2/52/5\) and pinpointing the specific node\. This empirically demonstrates that AEA injects the necessaryorthogonal evidence\(Corollary[4\.2](https://arxiv.org/html/2605.24197#S4.Thmtheorem2)\) to break the error cascade\.

Self\-Reflection induces posterior collapse\.Conversely, Figure[4](https://arxiv.org/html/2605.24197#A7.F4)\(Bottom\) illustrates a failure mode onAIMEusing GPT\-4o\. Despite an underlying issue at Step 1, the Self\-Reflection mechanism assigns a perfect rating \(5/55/5\)\. This visualizes theDominant Priorphenomenon \(Theorem[3\.2](https://arxiv.org/html/2605.24197#S3.Thmtheorem2)\): because the checking mechanism shares the same prior as the acting agent, it cannot distinguish the error from valid reasoning\. The system consequently “rationalizes” the mistake, collapsing into a misaligned state ofhallucinated compliance\.

## Appendix ICase Studies

We provide the case studies of the two instantiation of AEA\. They are primarily colored in three types: \(1\) Fixed Error means the MAS answer turns from incorrect to correct, \(2\) Stayed Wrong means the MAS answer keep give the incorrect answer, and \(3\) Regression means the MAS turns correct answer to incorrect\.

### I\.1Weak\-to\-Strong Cases

AEA: Fixed Error \(AIME24\)Transition:incorrect→\\rightarrowcorrectQuestion:Letℬ\\mathcal\{B\}be the set of rectangular boxes with surface area5454and volume2323\. Letrrbe the radius of the smallest sphere that can contain each of the rectangular boxes that are elements ofℬ\\mathcal\{B\}\. The value ofr2r^\{2\}…Original Answer:The final answer is 45 \(which is p \+ q, wherer2=41/4r^\{2\}=41/4with p = 41 and q = 4\)\.✗Agent Flagged:Geometry\_Expertat Step 4Alignment Feedback:In step 4, the Geometry\_Expert incorrectly states that the dimensions2×2×5\.752\\times 2\\times 5\.75give an exact value ofr2=41/4r^\{2\}=41/4\. However, the exact value isr2=41\.0625/4r^\{2\}=41\.0625/4, which simplifies to41/441/4only if5\.755\.75is considered an exact fraction\.After Alignment:Based on the expert conversation, the final answer is:p\+q=721p\+q=721\. This is becauser2=657/64r^\{2\}=657/64, wherep=657p=657andq=64q=64are relatively prime\.✓

AEA: Fixed Error \(AIME25\)Transition:incorrect→\\rightarrowcorrectQuestion:Six pointsA,B,C,D,EA,B,C,D,EandFFlie in a straight line in that order\. Suppose thatGGis a point not on the line and thatAC=26AC=26,BD=22BD=22,CE=31CE=31,DF=33DF=33,AF=73AF=73,CG=40CG=40, andDG=30DG=30\. Find the area of△BGE\\triangle BGE…Original Answer:The area of triangle△BGE\\triangle BGEis 468 square units\.✗Agent Flagged:TriangleArea\_Expertat Step 2Alignment Feedback:The TriangleArea\_Expert provided a solution that directly calculated the area using coordinates and the formula, but failed to verify whether the coordinates were consistent with the given distancesCG=40CG=40andDG=30DG=30\.After Alignment:The conversation involved calculating the area of triangleBGEBGEbased on the coordinates of six pointsA,B,C,D,E,FA,B,C,D,E,Farranged in a straight line…✓

AEA: Fixed Error \(HumanEval\)Transition:incorrect→\\rightarrowcorrectQuestion:def odd\_count\(lst\):Given a list of strings, where each string consists of only digits, return a list\. Each element i of the output should be “the number of odd elements in the string i of the input\.” where all the i’s should be replaced…Original Answer:def odd\_count\(lst\): \.\.\.✗Agent Flagged:Verification\_Expertat Step 6Alignment Feedback:The Verification\_Expert correctly identified that the function passed the initial test cases, but did not mention the critical issue that the formatted string has an extra ‘n’ in the phrase ‘the number of odd elements 4n the str4ng 4 of the 4nput\.’After Alignment:Fixed implementation with correct string formatting\.✓

AEA: Fixed Error \(DataBench\)Transition:incorrect→\\rightarrowcorrectQuestion:DATA ANALYSIS TASK: You are working with the Heart Disease Dataset\. Your task is to answer the following question through data analysis\. Question: What are the top 3 most common chest pain types? Expected answer type: list\[category\]…Original Answer:\[’ASY’, ’NAP’, ’ATA’\]✗Agent Flagged:HeartDisease\_Expertat Step 3Alignment Feedback:The HeartDisease\_Expert provided an incorrect classification of chest pain types\. The correct classification for the UCI Heart Disease Dataset is: 0 = Typical Angina \(TA\), 1 = Atypical Angina \(ATA\), 2 = Non\-Anginal Pain \(NAP\), 3 = Asymptomatic \(ASY\)\.After Alignment:\[“Non\-Anginal Pain \(NAP\)”, “Asymptomatic \(ASY\)”, “Atypical Angina \(ATA\)”\]✓

AEA: Fixed Error \(Chemistry\)Transition:incorrect→\\rightarrowcorrectQuestion:Assume that all gases are perfect and that data refer to 298\.15 K unless otherwise stated\. A sample consisting of3\.00mol3\.00~\\mathrm\{mol\}of diatomic perfect gas molecules at200K200~\\mathrm\{K\}is compressed reversibly…Original Answer:The heat transferqqfor the reversible adiabatic compression of the diatomic perfect gas is:q=200Jq=200~\\mathrm\{J\}✗Agent Flagged:Calorimetry\_Expertat Step 2Alignment Feedback:The Calorimetry\_Expert incorrectly stated that ‘In an adiabatic process, there is no heat exchange between the system and the surroundings, which means that the heat transfer, q, is zero\.’ This is factually correct, but the agent’s reasoning for this…After Alignment:The heat transferqqfor the adiabatic compression process is:q=0Jq=0~\\mathrm\{J\}✓

AEA: Fixed Error \(HumanEval\)Transition:incorrect→\\rightarrowcorrectQuestion:def multiply\(a, b\):Complete the function that takes two integers and returns the product of their unit digits\. Examples:multiply\(148, 412\)should return 16\.Original Answer:def multiply\(a, b\): unit\_digit\_a = abs\(a\) % 10; \.\.\.✗Agent Flagged:Software\_Expertat Step 3Alignment Feedback:The agent correctly implemented the solution using the modulus operator and absolute values, but the provided answer by the user does not match the expected output for negative numbers\. The agent’s implementation correctly handles negative numbers\.After Alignment:Correct implementation using absolute value and modulus\.✓

AEA: Fixed Error \(HumanEval\)Transition:incorrect→\\rightarrowcorrectQuestion:def generate\_integers\(a, b\):Given two positive integers a and b, return the even digits between a and b, in ascending order\. For example:generate\_integers\(2, 8\) =\> \[2, 4, 6, 8\]Original Answer:def generate\_integers\(a, b\): \.\.\.✗Agent Flagged:Python\_Expertat Step 4Alignment Feedback:The agent provided an implementation that uses a generator expression, but the initial problem statement and expected answer are not aligned with the solution\. The expected answer is a list comprehension with explicit variable names\.After Alignment:Correct implementation returning even digits in ascending order\.✓

AEA: Stayed Wrong \(AIME24\)Transition:incorrect→\\rightarrowincorrectQuestion:Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade\. In addition, each of the 900 residents owns a bag of candy hearts\. There are 437 residents who own exactly two…Original Answer:The number of residents of Aimeville who own all four items \(a diamond ring, a set of golf clubs, a garden spade, and a bag of candy hearts\) is 234\.✗Agent Flagged:Statistics\_Expertat Step 13Alignment Feedback:The Statistics\_Expert incorrectly stated that\|A∩B∩C\|=234\|A\\cap B\\cap C\|=234\(residents owning exactly three items\) is the same as\|A∩B∩C∩D\|\|A\\cap B\\cap C\\cap D\|\(residents owning all four items\)\. This is invalid reasoning because the 234 residents who own exactly three items…After Alignment:The number of residents of Aimeville who own all four items is 234\.✗

AEA: Regression \(AIME24\)Transition:correct→\\rightarrowincorrectQuestion:Find the number of ways to place a digit in each cell of a2×32\\times 3grid so that the sum of the two numbers formed by reading left to right is999999, and the sum of the three numbers formed by reading top to bottom is9999…Original Answer:45✓Agent Flagged: ConstraintOptimization\_Expertat Step 7Alignment Feedback:The ConstraintOptimization\_Expert initially failed to correctly interpret the problem and the example provided, leading to an incorrect constraint analysis\. However, after recognizing the error and revising their approach, they correctly identified the constraints\.After Alignment:The number of ways to place a digit in each cell of a2×32\\times 3grid so that the sum of the two numbers formed by reading left to right is999999, and the sum…✗

### I\.2Self\-Reflection Cases

Self\-Reflection: Fixed Error \(AIME24\)Transition:incorrect→\\rightarrowcorrectQuestion:Find the number of ways to place a digit in each cell of a2×32\\times 3grid so that the sum of the two numbers formed by reading left to right is999999, and the sum of the three numbers formed by reading top to bottom is9999…Original Answer:The experts concluded that there are no valid combinations of digits that satisfy both constraints of the problem\.✗Agent Flagged:Algebra\_Expertat Step 2Alignment Feedback:The Algebra\_Expert failed to recognize the inconsistency between the provided example and the constraints early in the conversation, causing the agents to pursue a solution for a potentially unsolvable problem\.After Alignment:The number of valid ways to fill the2×32\\times 3grid so that the sum of the two numbers formed by reading left to right is 999, and the sum of the three numbers…✓

Self\-Reflection: Fixed Error \(AIME25\)Transition:incorrect→\\rightarrowcorrectQuestion:On△ABC\\triangle ABCpointsA,D,EA,D,E, andBBlie in that order on sideAB¯\\overline\{AB\}withAD=4AD=4,DE=16DE=16, andEB=8EB=8\. PointsA,F,GA,F,G, andCClie in that order on sideAC¯\\overline\{AC\}withAF=13AF=13,FG=52FG=52, andGC=26GC=26\. LetMM…Original Answer:588✗Agent Flagged:Verification\_Specialistat Step 10Alignment Feedback:The Verification\_Specialist correctly identified a discrepancy between the triangle decomposition and shoelace formula results, but misinterpreted it as a self\-intersection problem rather than recognizing the correct area calculation method\.After Alignment:The area of heptagon AFNBCEM is 1536\.✓

Self\-Reflection: Fixed Error \(DataBench\)Transition:correct→\\rightarrowcorrectQuestion:DATA ANALYSIS TASK: You are working with the FIFA Football Dataset\. Your task is to answer the following question through data analysis\. Question: What is the most common preferred foot amongst players? Expected answer type: category…Original Answer:Right✓Agent Flagged:Noneat Step 0Alignment Feedback:There were no critical mistakes in this collaboration\. All agents worked effectively together, providing complementary expertise from different domains\. The FIFA\_Data\_Analyst\_Expert led the discussion appropriately\.After Alignment:Right✓

Self\-Reflection: Fixed Error \(Physics\)Transition:incorrect→\\rightarrowcorrectQuestion:Assume all gases are perfect unless stated otherwise\. Unless otherwise stated, thermodynamic data are for 298\.15 K\. Calculate the standard enthalpy of solution ofAgCl\(s\)\\mathrm\{AgCl\(s\)\}in water from the enthalpies of formation of the solid…Original Answer:The standard enthalpy of solution of AgCl\(s\) in water is 65\.49 kJ/mol\.✗Agent Flagged:Noneat Step 0Alignment Feedback:There were no critical mistakes in this conversation\. All agents collaborated effectively to solve the problem\. The SolutionChemistry\_Expert provided the initial approach, ThermodynamicCalculations\_Expert verified the calculation\.After Alignment:The standard enthalpy of solution \(ΔHsol∘\\Delta H^\{\\circ\}\_\{\\mathrm\{sol\}\}\) of AgCl\(s\) in water is 64\.71 kJ/mol\. This was calculated using Hess’s Law\.✓

Self\-Reflection: Fixed Error \(DataBench\)Transition:incorrect→\\rightarrowcorrectQuestion:DATA ANALYSIS TASK: You are working with the Taxi Trip Dataset\. Your task is to answer the following question through data analysis\. Question: Which payment type is the most common in the dataset? Expected answer type: category…Original Answer:Credit card✗Agent Flagged:TransportData\_Expertat Step 2Alignment Feedback:The TransportData\_Expert incorrectly set the foundation for analysis by claiming credit card \(1\) was the most common payment type without sufficient evidence\. Their simulated analysis showing credit card at∼\\sim60% and cash at∼\\sim40%…After Alignment:credit/debit card✓

Self\-Reflection: Stayed Wrong \(AIME24\)Transition:incorrect→\\rightarrowincorrectQuestion:LetAA,BB,CC, andDDbe points on the hyperbolax220−y224=1\\frac\{x^\{2\}\}\{20\}\-\\frac\{y^\{2\}\}\{24\}=1such thatABCDABCDis a rhombus whose diagonals intersect at the origin\. Find the greatest real number that is less thanBD2BD^\{2\}for all such rhombi…Original Answer:80✗Agent Flagged:noneat Step 0Alignment Feedback:The collaboration was excellent, with all agents contributing effectively to solving the problem\. The GeometricConstraints\_Expert correctly set up the parametrization of points on the hyperbola, derived the constrainttA=−tBt\_\{A\}=\-t\_\{B\}for forming a rhombus\.After Alignment:The greatest real number that is less thanBD2BD^\{2\}for all such rhombi is 191\.9999… or simply 192\.✗

Self\-Reflection: Regression \(AIME24\)Transition:correct→\\rightarrowincorrectQuestion:Consider the paths of length1616that follow the lines from the lower left corner to the upper right corner on an8×88\\times 8grid\. Find the number of such paths that change direction exactly four times, as in the examples shown below…Original Answer:294✓Agent Flagged:Noneat Step 0Alignment Feedback:The collaboration was excellent, with agents working together effectively to solve the problem\. The CombinatorialMath\_Expert and PathCounting\_Expert both provided thorough mathematical analyses that complemented each other, verified their work through multiple approaches\.After Alignment:The number of paths of length 16 that follow the lines from the lower left corner to the upper right corner on an8×88\\times 8grid and change direction exactly four times…✗
A Sober Look at Agentic Misalignment in Automated Workflows

Similar Articles

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

Agents need identity

ai governance for agentic workflows in regulated environments. what actually works in production?

AI agents are starting to expose how broken most workflows already were

Submit Feedback

Similar Articles

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment
ai governance for agentic workflows in regulated environments. what actually works in production?
AI agents are starting to expose how broken most workflows already were