Conformal Agent Error Attribution
Summary
This paper presents a framework for error attribution in multi-agent systems using conformal prediction, providing statistical guarantees for identifying decisive errors in agent trajectories. The approach enables automated recovery and debugging by isolating errors within contiguous prediction sets.
View Cached Full Text
Cached at: 05/11/26, 06:51 AM
# Conformal Agent Error Attribution
Source: [https://arxiv.org/html/2605.06788](https://arxiv.org/html/2605.06788)
Naihe Feng Dalhousie University Halifax, NS, Canada nfeng@dal\.ca &Yi Sui∗ Layer 6 AI Toronto, ON, Canada amy@layer6\.ai &Shiyi Hou Layer 6 AI Toronto, ON, Canada gloria@layer6\.ai Ga Wu Dalhousie University Halifax, NS, Canada ga\.wu@dal\.ca &Jesse C\. Cresswell Layer 6 AI Toronto, ON, Canada jesse@layer6\.ai
###### Abstract
When multi\-agent systems \(MAS\) fail, identifying where the decisive error occurred is the first step for automated recovery to an earlier state\. Error attribution remains a fundamental challenge due to the long interaction traces that large language model\-based MAS generate\. This paper presents a framework for error attribution based on conformal prediction \(CP\) which provides finite\-sample, distribution\-free coverage guarantees\. We introduce new algorithms for filtration\-based CP designed for sequential data such as agent trajectories\. Unlike existing CP algorithms, our approach predicts sets that are contiguous sequences to enable efficient recovery and debugging\. We verify our theoretical guarantees on a variety of agents and datasets, show that errors can be precisely isolated, then use prediction sets to rollback MAS to correct their own errors\. Our overall approach is model\-agnostic, and offers a principled uncertainty layer for MAS error attribution\. We release code at[github\.com/layer6ai\-labs/conformal\-agent\-error\-attribution](https://github.com/layer6ai-labs/conformal-agent-error-attribution)\.
## 1Introduction
Figure 1:Conformal agent error attribution isolates the decisive error in a failed MAS trajectory within a conformal prediction set, providing statistical guarantees of coverage\.Advances in large language models \(LLMs\) have driven the widespread adoption of multi\-agent systems \(MAS\) for complex tasks requiring decomposition, coordination, and tool use\[[10](https://arxiv.org/html/2605.06788#bib.bib38)\], with strong empirical performance in domains such as software engineering\[[14](https://arxiv.org/html/2605.06788#bib.bib40),[15](https://arxiv.org/html/2605.06788#bib.bib41)\], scientific discovery\[[8](https://arxiv.org/html/2605.06788#bib.bib43)\], and financial decision making\[[30](https://arxiv.org/html/2605.06788#bib.bib42)\]\. However, the increased system complexity and rich interactions in MAS make them prone to errors from incorrect intermediate decisions, miscoordination among agents, and long\-horizon dependencies\[[4](https://arxiv.org/html/2605.06788#bib.bib26),[19](https://arxiv.org/html/2605.06788#bib.bib3)\]\. While detecting*overall*task failure is often straightforward, understandingwhy and wherea failure originated remains challenging yet critical for debugging, and self\-correction\. Identifying the decision step that constitutes the decisive error point has emerged as a central challenge for improving MAS\.
Most existing MAS error attribution approaches, including naive LLM\-as\-a\-judge methods\[[32](https://arxiv.org/html/2605.06788#bib.bib1)\], structured reasoning pipelines\[[13](https://arxiv.org/html/2605.06788#bib.bib8),[33](https://arxiv.org/html/2605.06788#bib.bib33),[31](https://arxiv.org/html/2605.06788#bib.bib34)\], and fine\-tuned attribution models\[[17](https://arxiv.org/html/2605.06788#bib.bib27)\], ultimately produce a point prediction, committing to a single responsible step\. In practice, point predictions provide limited actionable insight for practitioners as they offer no principled form of uncertainty quantification to assess reliability, undermining the trustworthiness of error attribution systems\[[25](https://arxiv.org/html/2605.06788#bib.bib2)\]\. Conformal prediction \(CP\) offers a promising direction for addressing this limitation through the generation of*prediction sets*\. CP enables reliable decision\-making under uncertainty by providing statistical guarantees across a range of applications\[[2](https://arxiv.org/html/2605.06788#bib.bib25),[7](https://arxiv.org/html/2605.06788#bib.bib14)\]\.
Motivated by these developments, we propose an uncertainty\-aware framework for error attribution in MAS based on CP, which provides finite\-sample, distribution\-free coverage guarantees\. Rather than predicting a single step, our methods identify a localized region of the execution trace that is guaranteed to contain the decisive error at a user\-specified confidence level \(see[Figure˜1](https://arxiv.org/html/2605.06788#S1.F1)\)\. We introduce novel methods for filtration\-based CP which are adapted to sequential data structures, like agent trajectories\. Unlike existing CP approaches that produce arbitrary sets, our methods produce*contiguous*sets, aligning with the inherent ordinal structure of sequential data\. Finally, we use conformal sets to rollback the MAS to before the decisive error, allowing the agent to restart and fix its own mistakes\. Our approach is model\-agnostic and can wrap existing black\-box attribution scores, while providing a principled uncertainty layer for error attribution in real\-world MAS\.
## 2Background & Related Work
### 2\.1Post\-hoc Multi\-agent System Error Attribution
Recent work on error attribution in MAS has primarily studied post\-hoc localization of errors using execution traces\. Early approaches usednaive LLM\-as\-a\-judge, where a single LLM directly predicts the responsible step from a failed trace\[[32](https://arxiv.org/html/2605.06788#bib.bib1)\]\. Subsequent work improved attribution quality through more sophisticated LLM pipelines, including*context engineering*as well as multi\-LLM frameworks\. For example, ECHO\[[3](https://arxiv.org/html/2605.06788#bib.bib32)\]improves attribution by organizing long traces into hierarchical contexts and aggregating evaluations via consensus, while RAFFLES\[[33](https://arxiv.org/html/2605.06788#bib.bib33)\]employs a multi\-turn, multi\-LLM architecture that iteratively proposes and critiques candidate error steps\. Additionally, CORRECT\[[31](https://arxiv.org/html/2605.06788#bib.bib34)\]incorporates retrieval to localize the error step based on similar events\. A complementary line of workfine\-tunes specialized LLM judgesfor error attribution\. In particular, AEGIS\[[17](https://arxiv.org/html/2605.06788#bib.bib27)\]constructs large\-scale labeled failure traces via controlled error injection for fine\-tuning LLMs on the task\.
In our experiments we compare the efficacy of these three main classes of evaluators: naive, context\-engineered, and fine\-tuned LLM judges\. We note that all of the above research assumes a single decisive error, whereas in practical MAS applications small errors may accumulate into large ones\. Due to the lack of labeled datasets with more nuanced error definitions, we follow current work and focus on the decisive error setting\.
### 2\.2Conformal Prediction
For a classification problem where inputsx∈𝒳x\\in\\mathcal\{X\}and ground\-truth valuesy∗∈𝒴=\[ℓ\]:=\(1,…,ℓ\)y^\{\*\}\\in\\mathcal\{Y\}=\[\\ell\]:=\(1,\\dots,\\ell\)are drawn jointly from a distribution\(x,y∗\)∼ℙ\(x,y^\{\*\}\)\\sim\\mathbb\{P\}, CP first calibrates a thresholdq^\\hat\{q\}from a set of held\-out data\. Then for a new datapointxn\+1x\_\{n\+1\}, CP outputs a set of classesC\(xn\+1;q^\)⊆𝒴C\(x\_\{n\+1\};\\hat\{q\}\)\\subseteq\\mathcal\{Y\}which containsy∗y^\{\*\}with high probabilityℙ\[yn\+1∗∈C\(xn\+1;q^\)\]≥1−α\\mathbb\{P\}\[y\_\{n\+1\}^\{\*\}\\in C\(x\_\{n\+1\};\\hat\{q\}\)\]\\geq 1\-\\alpha\. This*coverage*guarantee is distribution\-free and valid in finite samples, while also allowing the user to set their own error toleranceα\\alpha\[[27](https://arxiv.org/html/2605.06788#bib.bib17),[26](https://arxiv.org/html/2605.06788#bib.bib20)\]\.
To perform CP, one first defines a*conformal score*functionS:𝒳×𝒴→ℝ\+S:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{\+\}, which should take smaller values wheny=y∗y=y^\{\*\}is the correct label forxx\. In practice,S\(x,y\)S\(x,y\)often leverages the predictions of a pre\-trained classification modelf:𝒳→𝒴f:\\mathcal\{X\}\\to\\mathcal\{Y\}\. Using a set ofnncalibration datapoints, CP computes the scores\{Si\}i=1n=\{S\(xi,yi∗\)\}i=1n\\\{S\_\{i\}\\\}\_\{i=1\}^\{n\}=\\\{S\(x\_\{i\},y^\{\*\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}, and finds the⌈\(n\+1\)\(1−α\)⌉n\\tfrac\{\\lceil\{\(n\+1\)\(1\-\\alpha\)\}\\rceil\}\{n\}quantile which is set as the thresholdq^\\hat\{q\}\. Then prediction sets can be generated by including all classes for which the score is greater thanq^\\hat\{q\},C\(xn\+1;q^\)=\{y∈𝒴∣S\(xn\+1,y\)≤q^\}C\(x\_\{n\+1\};\\hat\{q\}\)=\\\{y\\in\\mathcal\{Y\}\\mid S\(x\_\{n\+1\},y\)\\leq\\hat\{q\}\\\}\. Whenxn\+1x\_\{n\+1\}is exchangeable with the calibration data, the coverage guarantee is valid\. Exchangeability is a mild assumption that automatically holds when data is i\.i\.d\., and hence is reasonable for many machine learning contexts, including the agent error attribution task as we show in our experiments\. At equal coverage levels1−α1\-\\alpha, smaller prediction sets are considered more useful both for uncertainty quantification over the predictions offθf\_\{\\theta\}\[[24](https://arxiv.org/html/2605.06788#bib.bib19),[1](https://arxiv.org/html/2605.06788#bib.bib13),[16](https://arxiv.org/html/2605.06788#bib.bib18)\], and in downstream tasks\[[7](https://arxiv.org/html/2605.06788#bib.bib14),[6](https://arxiv.org/html/2605.06788#bib.bib15)\]\.
### 2\.3Conformal Prediction for Sequential Data
Another common setting is where data is sequential,x=\(c1,…,cℓ\)x=\(c\_\{1\},\\dots,c\_\{\\ell\}\), with variable lengthℓ\\ell, where the ground truthy∗⊂xy^\{\*\}\\subset xis a subset of elements\. FollowingKuwaharaet al\.\[[18](https://arxiv.org/html/2605.06788#bib.bib10)\], the principles of CP can be used to calibrate a thresholdq^\\hat\{q\}, and predict a subsetC\(xn\+1;q^\)⊆xn\+1C\(x\_\{n\+1\};\\hat\{q\}\)\\subseteq x\_\{n\+1\}which retains the ground truth elements with high probability,ℙ\[yn\+1∗⊆C\(xn\+1;q^\)\]≥1−α\\mathbb\{P\}\[y\_\{\{n\+1\}\}^\{\*\}\\subseteq C\(x\_\{n\+1\};\\hat\{q\}\)\]\\geq 1\-\\alpha\. In some settings,y∗y^\{\*\}will consist of multiple elements, and the predicted setC\(xn\+1;q^\)C\(x\_\{n\+1\};\\hat\{q\}\)need not be contiguous\. For agent error attribution we takexxto be the agent’s trajectory, andy∗y^\{\*\}to be the single decisive error—one of thecic\_\{i\}\. As we will discuss, for downstream applications including automated rollbacks of the agent’s state it is desirable to predict sets of*consecutive*elements, rather than arbitrary subsets\. Hence, we develop novel CP algorithms that satisfy a coverage guarantee using contiguous prediction sets,
ℙ\[yn\+1∗∈C\(xn\+1;q^\)\]≥1−α,C\(xn\+1;q^\)=\(cj,…,ck\)\.\\mathbb\{P\}\[y\_\{n\+1\}^\{\*\}\\in C\(x\_\{n\+1\};\\hat\{q\}\)\]\\geq 1\-\\alpha,\\quad C\(x\_\{n\+1\};\\hat\{q\}\)=\(c\_\{j\},\.\.\.,c\_\{k\}\)\.\(1\)
The only existing CP algorithms that produce contiguous sets were designed for hierarchical classification\[[21](https://arxiv.org/html/2605.06788#bib.bib4)\]\. We describe one such algorithm in[Section˜3\.1\.2](https://arxiv.org/html/2605.06788#S3.SS1.SSS2)and adapt it for sequential data\.
## 3Conformal Agent Error Attribution
For the remainder of this work we takex=\(c1,…,cℓ\)x=\(c\_\{1\},\\dots,c\_\{\\ell\}\)to be an agent trajectory which has failed to complete the desired task\. Each stepcjc\_\{j\}can contain any available information such as the environment’s state, action taken, and observed response\. One of the stepsy∗∈xy^\{\*\}\\in xis labeled as the decisive error—the earliest error that the MAS cannot recover from\. The aim is to produce a prediction setC\(xn\+1\)⊆xn\+1C\(x\_\{n\+1\}\)\\subseteq x\_\{n\+1\}that gives valid coverage, where smaller sets are preferred\.
Applying CP to agent error attribution requires two components: an algorithm which takes a calibration dataset and generates a prediction set forxn\+1x\_\{n\+1\}with valid coverage; and a scoring functiong\(C\(x\)\)g\(C\(x\)\)acting on sets of steps, which quantifies the likelihood thaty∗∈C\(x\)y^\{\*\}\\in C\(x\)\. We design these two components separately so that they are interchangeable, and discuss the pros and cons of each option\.
### 3\.1Conformal Algorithms for Agent Error Attribution
#### 3\.1\.1Vanilla Conformal Prediction
The simplest approach is to ignore the sequential nature ofxxand treat all steps as unordered classes in anℓ\\ell\-way classification task\. We will writeSVCP=1−gS\_\{\\text\{VCP\}\}=1\-gfor the conformal score function, andCVCP\(xn\+1;q^\)=\{ci∈xn\+1∣SVCP\(xn\+1,ci\)≤q^\}C\_\{\\text\{VCP\}\}\(x\_\{n\+1\};\\hat\{q\}\)=\\\{c\_\{i\}\\in x\_\{n\+1\}\\mid S\_\{\\text\{VCP\}\}\(x\_\{n\+1\},c\_\{i\}\)\\leq\\hat\{q\}\\\}for prediction sets generated by Vanilla CP \(VCP\)\. Prediction requires iterating over every step in the trajectory usingℓ\\ellevaluations ofgg, and does not produce contiguous sets\.
#### 3\.1\.2Leaf\-to\-Root Tree Traversal
v1=\(c1,c2,c3,c4\)v\_\{1\}=\(c\_\{1\},c\_\{2\},c\_\{3\},c\_\{4\}\)v2=\(c1,c2\)v\_\{2\}=\(c\_\{1\},c\_\{2\}\)v4=\(c1\)\\begin\{subarray\}\{c\}v\_\{4\}=\(c\_\{1\}\)\\end\{subarray\}v5=\(c2\)\\begin\{subarray\}\{c\}v\_\{5\}=\(c\_\{2\}\)\\end\{subarray\}v3=\(c3,c4\)v\_\{3\}=\(c\_\{3\},c\_\{4\}\)v6=\(c3\)\\begin\{subarray\}\{c\}v\_\{6\}=\(c\_\{3\}\)\\end\{subarray\}v7=\(c4\)\\begin\{subarray\}\{c\}v\_\{7\}=\(c\_\{4\}\)\\end\{subarray\}Figure 2:An example binary tree𝒯\\mathcal\{T\}representing an agent trajectoryxxconsisting of four stepsc1,…,c4c\_\{1\},\.\.\.,c\_\{4\}\. Contiguous prediction setsC\(xn\+1\)C\(x\_\{n\+1\}\)will consist of a single nodeviv\_\{i\}\.To produce contiguous sets, we can adapt algorithms for hierarchical classification by mapping agent trajectoriesxxonto a binary tree𝒯\\mathcal\{T\}as depicted in[Figure˜2](https://arxiv.org/html/2605.06788#S3.F2), with root nodev1=\[ℓ\]v\_\{1\}=\[\\ell\], and leaf nodesvℓ,…,v2ℓ−1v\_\{\\ell\},\\dots,v\_\{2\\ell\-1\}as individual stepsc1,…,cℓc\_\{1\},\\dots,c\_\{\\ell\}\. CP is conducted by traversing the tree from leaf to root following the CRSVP algorithm\[[21](https://arxiv.org/html/2605.06788#bib.bib4)\]described in full detail in[Appendix˜B](https://arxiv.org/html/2605.06788#A2)\. For each test datapoint, CRSVP outputs one node of the tree as the prediction set which is always a contiguous set, and guarantees the lower bound on coverage in[Equation˜1](https://arxiv.org/html/2605.06788#S2.E1)\.
CRSVP lacks an upper bound on coverage, usesℓ\\ellevaluations ofggfor prediction, and produces inflexible sets following the tree’s splits\. For example, in[Figure˜2](https://arxiv.org/html/2605.06788#S3.F2)the middle stepsc2c\_\{2\}andc3c\_\{3\}can only be predicted together in the trivial case where all steps are predicted \(v1v\_\{1\}\)\. VCP and CRSVP serve as baselines in our experiments\. The following novel algorithms improve on their limitations\.
#### 3\.1\.3Left \(Right\) Filtration
Figure 3:Left filtering withFLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)progressively removes steps from the left asqqdecreases\. The smallestqqwhich retains the decisive errory∗y^\{\*\}is used as the conformal scoreSLF\(x,y∗\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\.Viewing a trajectoryx=\(c1,…,cℓ\)x=\(c\_\{1\},\\dots,c\_\{\\ell\}\)as a sequence, Left Filtration \(LF\) progressively removes steps fromxxstarting on the left withc1c\_\{1\}until the remaining subsequence scores below a calibrated threshold, returning a*suffix*of the full sequence\.
We assume access to a scoring functiongLFg\_\{\\text\{LF\}\}which scores subintervalscj:k:=\(cj,…,ck\)⊆xc\_\{j:k\}:=\(c\_\{j\},\\dots,c\_\{k\}\)\\subseteq xfor their likelihood to containy∗y^\{\*\}, imposing the boundary conditionsgLF\(∅\)=0g\_\{\\text\{LF\}\}\(\\varnothing\)=0, since the empty interval cannot containy∗y^\{\*\}, andgLF\(x\)=∞g\_\{\\text\{LF\}\}\(x\)=\\inftysincey∗∈xy^\{\*\}\\in x\. We will usej∗j^\{\*\}as the index wherey∗y^\{\*\}appears, socj∗=y∗c\_\{j^\{\*\}\}=y^\{\*\}\. Finally, it is desirable, but not mandatory, to havegLFg\_\{\\text\{LF\}\}obey a monotonicity condition:
cb:c⊆ca:d⟹gLF\(cb:c\)≤gLF\(ca:d\)\.c\_\{b:c\}\\subseteq c\_\{a:d\}\\implies g\_\{\\text\{LF\}\}\(c\_\{b:c\}\)\\leq g\_\{\\text\{LF\}\}\(c\_\{a:d\}\)\.\(2\)Next, we define a filtering function which returns the longest suffix that has a low enough scoregLFg\_\{\\text\{LF\}\}\. Formally, we write the set of suffixes asℐLF:=\{cj:ℓ\}j=1ℓ\+1\\mathcal\{I\}\_\{\\text\{LF\}\}:=\\\{c\_\{j:\\ell\}\\\}\_\{j=1\}^\{\\ell\+1\}, with the conventions for subintervals thatcj:j=cjc\_\{j:j\}=c\_\{j\}, andcj:k=∅c\_\{j:k\}=\\varnothingwhenj\>kj\>k\. With these conventions, the left\-filtering function is
FLF\(x;q\):=argmaxcj:ℓ∈ℐLF\(\|cj:ℓ\|∣gLF\(cj:ℓ\)≤q\),F\_\{\\text\{LF\}\}\(x;q\):=\\operatorname\*\{arg\\,max\}\_\{c\_\{j:\\ell\}\\in\\mathcal\{I\}\_\{\\text\{LF\}\}\}\\big\(\|c\_\{j:\\ell\}\|\\mid g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\leq q\\big\),\(3\)whereq∈ℝ\+q\\in\\mathbb\{R\}^\{\+\}, and we note thatFLF\(x;∞\)=xF\_\{\\text\{LF\}\}\(x;\\infty\)=x\. Since the suffixes inℐLF\\mathcal\{I\}\_\{\\text\{LF\}\}are nested,FLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)also satisfies a nesting property demonstrating that more steps are filtered out asqqdecreases\. \(All formal proofs are given in[Section˜A\.2](https://arxiv.org/html/2605.06788#A1.SS2)\.\)\{restatable\}lemmaLFNesting For anyxxand thresholds0≤q1≤q20\\leq q\_\{1\}\\leq q\_\{2\},FLF\(x;q1\)⊆FLF\(x;q2\)F\_\{\\text\{LF\}\}\(x;q\_\{1\}\)\\subseteq F\_\{\\text\{LF\}\}\(x;q\_\{2\}\)\.
###### Proof sketch\.
Suffixes themselves are nested\. Whenq1q\_\{1\}increases toq2q\_\{2\}, the set of valid suffixes withgLF\(cj:ℓ\)≤qg\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\leq qcan only grow, andFLFF\_\{\\text\{LF\}\}returns the largest valid suffix\. ∎
For a trajectoryxx, the conformal score is effectively the smallest thresholdqqwherey∗y^\{\*\}is not filtered out \(see[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3)\)\. More formally we define
SLF\(x,y∗\):=inf\(q∈ℝ\+∣y∗∈FLF\(x;q\)\)\.S\_\{\\text\{LF\}\}\(x,y^\{\*\}\):=\\inf\\big\(q\\in\\mathbb\{R^\{\+\}\}\\mid y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\)\\big\)\.\(4\)Although this definition involves two optimizations,SLFS\_\{\\text\{LF\}\}is designed so that its computation greatly simplifies in practice\. WhengLFg\_\{\\text\{LF\}\}obeys monotonicity \([Equation˜2](https://arxiv.org/html/2605.06788#S3.E2)\),SLFS\_\{\\text\{LF\}\}is simply the value ofgLFg\_\{\\text\{LF\}\}on the suffix that starts aty∗y^\{\*\}\.\{restatable\}propositionLFComputation WhengLFg\_\{\\text\{LF\}\}obeys monotonicity such thatcb:c⊆ca:d⟹gLF\(cb:c\)≤gLF\(ca:d\)c\_\{b:c\}\\subseteq c\_\{a:d\}\\implies g\_\{\\text\{LF\}\}\(c\_\{b:c\}\)\\leq g\_\{\\text\{LF\}\}\(c\_\{a:d\}\), thenSLF\(x,y∗\)=gLF\(cj∗:ℓ\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)=g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\)\.
###### Proof sketch\.
SLF\(x,y∗\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)optimizes overqqwhereFLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)returns a suffix at least as large ascj∗:ℓc\_\{j^\{\*\}:\\ell\}\. Due to monotonicity, suffixes larger thancj∗:ℓc\_\{j^\{\*\}:\\ell\}cannot have scores less thangLF\(cj∗:ℓ\)g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\), sogLF\(cj∗:ℓ\)g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\)is the smallest valid threshold\. ∎
The LF conformal algorithm computesSLFS\_\{\\text\{LF\}\}for each calibration datapoint, and sets the conformal thresholdq^\\hat\{q\}as the⌈\(n\+1\)\(1−α\)⌉n\\frac\{\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\}\{n\}quantile\. For a test datapoint we predictCLF\(xn\+1;q^\)=FLF\(xn\+1;q^\)C\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\)=F\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\), which is the longest suffixcj:ℓc\_\{j:\\ell\}such thatgLF\(cj:ℓ\)≤q^g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\leq\\hat\{q\}\. A shorter suffix is preferable since it better isolates the decisive error\. This algorithm satisfies[Equation˜1](https://arxiv.org/html/2605.06788#S2.E1)for any scoring functiongLFg\_\{\\text\{LF\}\}as defined above\. Before proving this coverage guarantee, we present a lemma to assist:\{restatable\}lemmaLFEquivalence For a fixedq^\\hat\{q\}, we haveSLF\(x,y∗\)≤q^⇔y∗∈FLF\(x;q^\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\\iff y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;\\hat\{q\}\)\.
###### Proof sketch\.
The infimum inSLF\(x,y∗\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)is always achieved atqmin=min\{gLF\(cj:ℓ\)∣j≤j∗\}q\_\{\\text\{min\}\}=\\min\\\{g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\mid j\\leq j^\{\*\}\\\}, andy∗∈FLF\(x;qmin\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\_\{\\text\{min\}\}\)\. BecauseFLFF\_\{\\text\{LF\}\}is nested inqq\([Equation˜3](https://arxiv.org/html/2605.06788#S3.E3)\), increasingqminq\_\{\\text\{min\}\}toq^\\hat\{q\}maintainsy∗∈FLF\(x;q^\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;\\hat\{q\}\)\. ∎
With these facts established, we can prove the coverage guarantee of[Equation˜1](https://arxiv.org/html/2605.06788#S2.E1)using a standard conformal argument\.\{restatable\}theoremLFTheorem Suppose\{\(xi,yi∗\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}^\{\*\}\)\\\}\_\{i=1\}^\{n\}and\(xn\+1,yn\+1∗\)\(x\_\{n\+1\},y\_\{n\+1\}^\{\*\}\)are exchangeable\. Given a scoring functiongLFg\_\{\\text\{LF\}\}acting on subintervals of thexix\_\{i\}, define the conformal scoreSLF\(xi,yi∗\)S\_\{\\text\{LF\}\}\(x\_\{i\},y\_\{i\}^\{\*\}\)as in[Equation˜4](https://arxiv.org/html/2605.06788#S3.E4)\. Letq^\\hat\{q\}be the⌈\(n\+1\)\(1−α\)⌉n\\tfrac\{\\lceil\{\(n\+1\)\(1\-\\alpha\)\}\\rceil\}\{n\}quantile of conformal scores\{Si\}i=1n=\{SLF\(xi,yi∗\)\}i=1n\\\{S\_\{i\}\\\}\_\{i=1\}^\{n\}=\\\{S\_\{\\text\{LF\}\}\(x\_\{i\},y^\{\*\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}\. Then prediction sets constructed asCLF\(xn\+1;q^\)=FLF\(xn\+1;q^\)C\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\)=F\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\)satisfy1−α≤ℙ\[yn\+1∗∈CLF\(xn\+1;q^\)\]<1−α\+1n\+11\-\\alpha\\leq\\mathbb\{P\}\[y\_\{n\+1\}^\{\*\}\\in C\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\)\]<1\-\\alpha\+\\frac\{1\}\{n\+1\}\.
###### Proof\.
Since the fixed functionSLFS\_\{\\text\{LF\}\}is applied to each element of the exchangeable sequence\(x1,y1∗\),…,\(xn\+1,yn\+1∗\)\(x\_\{1\},y^\{\*\}\_\{1\}\),\\dots,\(x\_\{n\+1\},y^\{\*\}\_\{n\+1\}\), the resulting sequenceS1,…,Sn\+1S\_\{1\},\\dots,S\_\{n\+1\}is also exchangeable\. We assume for simplicity that the scores are non\-degenerate, because noise can easily be added to break ties\. LetS\(1\)≤S\(2\)≤⋯≤S\(n\)S\_\{\(1\)\}\\leq S\_\{\(2\)\}\\leq\\dots\\leq S\_\{\(n\)\}be the order statistics of the calibration scores so that the empirical quantileq^\\hat\{q\}can be defined asS\(k\)S\_\{\(k\)\}, wherek=⌈\(n\+1\)\(1−α\)⌉k=\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\(assumingn≥1α−1n\\geq\\tfrac\{1\}\{\\alpha\}\-1to ensurek≤nk\\leq n\)\. The eventSn\+1≤q^S\_\{n\+1\}\\leq\\hat\{q\}occurs if and only ifSn\+1S\_\{n\+1\}is among thekksmallest scores overall\. From exchangeability, all orderings are equally likely, soℙ\[Sn\+1≤q^\]=kn\+1\\mathbb\{P\}\[S\_\{n\+1\}\\leq\\hat\{q\}\]=\\frac\{k\}\{n\+1\}\. By[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3)this event is equivalent toy∗∈FLF\(xn\+1;q^\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\)\. Hence,ℙ\[y∗∈FLF\(xn\+1;q^\)\]=⌈\(n\+1\)\(1−α\)⌉n\+1≥1−α\\mathbb\{P\}\[y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x\_\{n\+1\};\\hat\{q\}\)\]=\\frac\{\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\}\{n\+1\}\\geq 1\-\\alpha\. For the upper bound, we simply note that⌈\(n\+1\)\(1−α\)⌉n\+1<1\+\(n\+1\)\(1−α\)n\+1=1−α\+1n\+1\\frac\{\\lceil\(n\{\+\}1\)\(1\{\-\}\\alpha\)\\rceil\}\{n\+1\}\{<\}\\frac\{1\{\+\}\(n\{\+\}1\)\(1\{\-\}\\alpha\)\}\{n\+1\}\{=\}1\{\-\}\\alpha\{\+\}\\frac\{1\}\{n\+1\}\. ∎
Beyond generating contiguous prediction sets, LF uses fewer scoring function evaluations on average for inference whengLFg\_\{\\text\{LF\}\}is monotonic\. LF starts with the first suffixcℓ:ℓc\_\{\\ell:\\ell\}and evaluates each longer suffixcj:ℓc\_\{j:\\ell\}until one withgLF\(cj:ℓ\)\>qg\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\>qis found, then returnscj\+1:ℓc\_\{j\+1:\\ell\}\. When errors are evenly distributed over the lengthℓ\\ell, the average number of evaluations with a stronggLFg\_\{\\text\{LF\}\}is onlyℓ\+12\\tfrac\{\\ell\+1\}\{2\}\.
Of course, there is nothing special about filtering from the left; Right Filtration \(RF\) can easily be defined which filters from the right, returning a*prefix*ofxx\. Using a similarly defined scoring functiongRFg\_\{\\text\{RF\}\}we write the relevant prefixes asℐRF:=\{c1:k\}k=0ℓ\\mathcal\{I\}\_\{\\text\{RF\}\}:=\\\{c\_\{1:k\}\\\}\_\{k=0\}^\{\\ell\}and define the right\-filtering functionFRF\(x;q\):=argmaxc1:k∈ℐRF\(\|c1:k\|∣gRF\(c1:k\)≤q\)F\_\{\\text\{RF\}\}\(x;q\):=\\operatorname\*\{arg\\,max\}\_\{c\_\{1:k\}\\in\\mathcal\{I\}\_\{\\text\{RF\}\}\}\\big\(\|c\_\{1:k\}\|\\mid g\_\{\\text\{RF\}\}\(c\_\{1:k\}\)\\leq q\\big\)\. Based on these definitions, the conformal score isSRF\(x,y∗\):=inf\(q∈ℝ\+∣y∗∈FRF\(x;q\)\)S\_\{\\text\{RF\}\}\(x,y^\{\*\}\):=\\inf\\big\(q\\in\\mathbb\{R^\{\+\}\}\\mid y^\{\*\}\\in F\_\{\\text\{RF\}\}\(x;q\)\\big\), and prediction sets are generated asCRF\(xn\+1;q^\)=FRF\(xn\+1;q^\)C\_\{\\text\{RF\}\}\(x\_\{n\+1\};\\hat\{q\}\)=F\_\{\\text\{RF\}\}\(x\_\{n\+1\};\\hat\{q\}\)\. RF gives coverage as a straightforward extension of[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3)and is advantageous over LF if agents tend to fail earlier in their trajectory rather than later\.
#### 3\.1\.4Two\-Way Filtration
Figure 4:Two\-way filtering withFTWF\(x;q\)F\_\{\\text\{TWF\}\}\(x;q\)uses the intersection of filtering from the right and left\. The smallestqqwhich retains the decisive errory∗y^\{\*\}is the conformal scoreSTWF\(x,y∗\)S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)\.Filtering from one direction has downsides; when the decisive error is at the start \(end\), the suffix \(prefix\) which covers the ground truth will contain the entire trajectory\. Moreover, when the decisive error is near the middle, neither LF nor RF can isolate it\. However, these limitations can be avoided by building on LF and RF with Two\-Way Filtration \(TWF\)\. Bidirectional filtering allows more precise isolation of the decisive error in principle, regardless of where in the trajectory it occurs\.
There are many possible ways to define a bidirectional filtration\. We present a version that uses the*intersection*of left and right filtered subintervals\. UsinggLFg\_\{\\text\{LF\}\}andgRFg\_\{\\text\{RF\}\}as above, we define the two\-way filtering function as
FTWF\(x;q\):=FLF\(x;q\)∩FRF\(x;q\)\.F\_\{\\text\{TWF\}\}\(x;q\):=F\_\{\\text\{LF\}\}\(x;q\)\\cap F\_\{\\text\{RF\}\}\(x;q\)\.\(5\)This function finds the longest suffix and prefix that each score at mostqq, and takes their intersection\. Note that this function still satisfiesFTWF\(x;∞\)=xF\_\{\\text\{TWF\}\}\(x;\\infty\)=x, as well as nesting: \(All formal proofs are given in[Section˜A\.3](https://arxiv.org/html/2605.06788#A1.SS3)\)\{restatable\}lemmaTWFNesting For anyxxand thresholds0≤q1≤q20\\leq q\_\{1\}\\leq q\_\{2\},FTWF\(x;q1\)⊆FTWF\(x;q2\)F\_\{\\text\{TWF\}\}\(x;q\_\{1\}\)\\subseteq F\_\{\\text\{TWF\}\}\(x;q\_\{2\}\)\.
###### Proof sketch\.
The result follows from nesting for L/RF, and set inclusion under intersection: for any setsA,B,C,DA,B,C,D, ifA⊆BA\\subseteq BandC⊆DC\\subseteq D, thenA∩C⊆B∩DA\\cap C\\subseteq B\\cap D\. ∎
On top of this we can define the conformal score function
STWF\(x,y∗\):=inf\(q∈ℝ\+∣y∗∈FTWF\(x;q\)\),S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\):=\\inf\\big\(q\\in\\mathbb\{R^\{\+\}\}\\mid y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x;q\)\\big\),\(6\)which operates the same way as before, effectively finding the smallest thresholdqqunder which two\-way filtering retains the decisive error\. AlthoughSTWFS\_\{\\text\{TWF\}\}appears to involve three optimizations, its computation can be simplified to only two scoring function evaluations:
\{restatable\}
propositionTWFComputationSTWF\(x,y∗\)S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)as defined in[Equation˜6](https://arxiv.org/html/2605.06788#S3.E6)can be expressed as
STWF\(x,y∗\)=max\(SLF\(x,y∗\),SRF\(x,y∗\)\)\.S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)=\\max\\big\(S\_\{\\text\{LF\}\}\(x,y^\{\*\}\),S\_\{\\text\{RF\}\}\(x,y^\{\*\}\)\\big\)\.\(7\)WhengLFg\_\{\\text\{LF\}\}andgRFg\_\{\\text\{RF\}\}obey the monotonicity condition in[Equation˜2](https://arxiv.org/html/2605.06788#S3.E2),STWFS\_\{\\text\{TWF\}\}further simplifies to
STWF\(x,y∗\)=max\(gLF\(cj∗:ℓ\),gRF\(c1:j∗\)\)\.S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)=\\max\\big\(g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\),g\_\{\\text\{RF\}\}\(c\_\{1:j^\{\*\}\}\)\\big\)\.\(8\)
###### Proof sketch\.
qqis a valid threshold \(y∗∈FTWF\(x;q\)y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x;q\)\) only wheny∗y^\{\*\}is included by bothFLFF\_\{\\text\{LF\}\}andFRFF\_\{\\text\{RF\}\}, which meansqqis sufficiently high for bothSLFS\_\{\\text\{LF\}\}andSRFS\_\{\\text\{RF\}\}\. We must use the higher of these two thresholds, giving[Equation˜7](https://arxiv.org/html/2605.06788#S3.E7)\.[Equation˜8](https://arxiv.org/html/2605.06788#S3.E8)follows from substituting in[Equation˜4](https://arxiv.org/html/2605.06788#S3.E4)\. ∎
Like for LF, the TWF conformal algorithm computesSTWFS\_\{\\text\{TWF\}\}for each calibration datapoint, sets the conformal thresholdq^\\hat\{q\}as the⌈\(n\+1\)\(1−α\)⌉n\\frac\{\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\}\{n\}quantile, and predictsCTWF\(xn\+1;q^\)=FTWF\(xn\+1;q^\)C\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)=F\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)for any test datapointxn\+1x\_\{n\+1\}\.FTWF\(xn\+1;q^\)F\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)will tend to be a short subinterval whengLFg\_\{\\text\{LF\}\}andgRFg\_\{\\text\{RF\}\}both narrow in on the same steps\. This algorithm satisfies[Equation˜1](https://arxiv.org/html/2605.06788#S2.E1)with a theorem and proof similar to[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3)\(see[Section˜A\.3](https://arxiv.org/html/2605.06788#A1.SS3)\)\. The core prerequisite is the analogue of[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3):\{restatable\}lemmaTWFEquivalence For a fixedq^\\hat\{q\}, we haveSTWF\(x,y∗\)≤q^⇔y∗∈FTWF\(x;q^\)S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\\iff y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x;\\hat\{q\}\)\.
###### Proof sketch\.
From the use of intersection inFTWF\(x;q^\)F\_\{\\text\{TWF\}\}\(x;\\hat\{q\}\)\([Equation˜5](https://arxiv.org/html/2605.06788#S3.E5)\), and the result of[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3),y∗∈FTWF\(x,q^\)⇔SLF\(x,y∗\)≤q^∧SRF\(x,y∗\)≤q^y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x,\\hat\{q\}\)\\iff S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\\wedge S\_\{\\text\{RF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\. Equivalently we writemax\(SLF\(x,y∗\),SRF\(x,y∗\)\)≤q^\\max\\big\(S\_\{\\text\{LF\}\}\(x,y^\{\*\}\),S\_\{\\text\{RF\}\}\(x,y^\{\*\}\)\\big\)\\leq\\hat\{q\}, which is the same asSTWF\(x,y∗\)≤q^S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\([Figure˜4](https://arxiv.org/html/2605.06788#S3.F4)\)\. ∎
While we described these algorithms using the language of agent trajectories, they can equally be applied to any type of sequential dataxxwith ground truthy∗∈xy^\{\*\}\\in x, for example electronic health records wherexxis a sequence of vitals measurements, andy∗y^\{\*\}is the first warning sign that should trigger medical intervention\[[23](https://arxiv.org/html/2605.06788#bib.bib7)\]\.
Table 1:Conformal Algorithm PropertiesMethodContiguousInference NFECov\. Upper BoundVCP×\\bm\{\\times\}ℓ\\ell✓\\bm\{\\checkmark\}CRSVP✓\\bm\{\\checkmark\}ℓ\\ell×\\bm\{\\times\}L/RF✓\\bm\{\\checkmark\}≈12\(ℓ\+1\)\\approx\\frac\{1\}\{2\}\(\\ell\+1\)✓\\bm\{\\checkmark\}TWF✓\\bm\{\\checkmark\}ℓ\\ell✓\\bm\{\\checkmark\}The five conformal algorithms we compare are summarized in[Table˜1](https://arxiv.org/html/2605.06788#S3.T1), showing which generate contiguous prediction sets, the number of function evaluations \(NFE\) ofggneeded for each test datapoint in the case whereggis strong and errors are uniformly distributed, and whether an upper bound on coverage is guaranteed\.
### 3\.2Scoring Functions for Agent Error Attribution
Each of the preceding conformal algorithms requires a scoring functionggwhich estimates how likely a step, or set of steps, is to contain the decisive error\. Since MAS traces are primarily composed of textual agent interactions, our scoring functions rely on LLM\-based components to map agent inputs and outputs to numerical scores\. We compare three LLM regimes as outlined in[Section˜2\.1](https://arxiv.org/html/2605.06788#S2.SS1):
1. 1\.*Naive LLM\-as\-a\-judge*is implemented by prompting gpt\-4o\-mini to estimate error likelihoods with information about the task and step, as in\[[32](https://arxiv.org/html/2605.06788#bib.bib1)\];
2. 2\.*Prompt and context engineering*produces step\-level likelihood estimates using multiple LLMs with role\-specific prompts, inspired by ECHO\[[3](https://arxiv.org/html/2605.06788#bib.bib32)\]\. Scores from LLMs with different roles are averaged to get a final likelihood\.
3. 3\.A*fine\-tuned specialized LLM*is implemented by following the data generation and training paradigm of AEGIS\[[17](https://arxiv.org/html/2605.06788#bib.bib27)\]to fine\-tune a Qwen3\-1\.7B model\[[29](https://arxiv.org/html/2605.06788#bib.bib39)\]\.
Additional details on prompts, training, and data are given in[Appendix˜C](https://arxiv.org/html/2605.06788#A3)\. For VCP, step\-level scores as described above are used directly\. However, L/R/TWF and CRSVP require set\-level scores\. To enable efficient computation in L/R/TWF we designggto obey monotonicity \([Equation˜2](https://arxiv.org/html/2605.06788#S3.E2)\) by aggregating step\-level scores\. Specifically, we use summation and normalize by the trajectory lengthℓ\\ellto make conformal scores for different datapoints more comparable:g\(cj:k\)=1ℓ∑i=jkLLM\(ci\)g\(c\_\{j:k\}\)=\\frac\{1\}\{\\ell\}\\sum\_\{i=j\}^\{k\}\\text\{LLM\}\(c\_\{i\}\)\. We experiment with alternative monotonic aggregations, namely Max and LogSumExp, in[Section˜C\.2](https://arxiv.org/html/2605.06788#A3.SS2), and discuss circumstances when they may be preferred\. Below, we evaluate the discriminatory power ofggindependently from CP algorithms by treating it as a multiclass classifier\.
### 3\.3Conformal Agent Rollbacks
Figure 5:After generating a conformal set for a failed task, we roll back the state of the MAS to the first step in the set, and restart the agent with information on the failed trace\.CP sets for agent error attribution have multiple uses\. They can be used by humans for manual debugging; a person can focus only on the steps within the set and have good coverage of true errors\. Here, contiguity is a great benefit—it is much easier to debug several consecutive steps than to parse through a scattered set\.
However, our main downstream application is automated agent recovery\. Knowing that an agent has failed, we wish for it to learn from its mistakes and retry parts of the task\. Optimally, the agent’s state must be rolled back in the trajectory only far enough to cover the decisive error\. Rolling back further is inefficient\. However, error attribution accuracy is low for point prediction\[[32](https://arxiv.org/html/2605.06788#bib.bib1),[3](https://arxiv.org/html/2605.06788#bib.bib32),[33](https://arxiv.org/html/2605.06788#bib.bib33)\]\.
Prediction sets with coverage guarantees provide the solution for automated rollbacks of failed trajectories\. Given a conformal set, we roll back the state of the MAS to the first step in the set, and restart the trajectory, as depicted in[Figure˜5](https://arxiv.org/html/2605.06788#S3.F5)\. The coverage guarantee ensures with high confidence that we roll back far enough to correct the decisive error, while contiguity ensures that we don’t roll back excessively far\. Upon restarting, we add information to the prompt about the steps taken previously, and instruct the MAS to replan its trajectory to avoid making the same mistakes\.
## 4Experiments
### 4\.1Datasets
We evaluate CP algorithms, scoring functions, and downstream task performance on both a real\-world benchmark and synthetic MAS traces\. TheWho&Whendataset\[[32](https://arxiv.org/html/2605.06788#bib.bib1)\]\(MIT license\) is a benchmark for step\-level error attribution in MAS\. Each of the 184 real\-world examples contains a full agent execution trace annotated with the decisive error step\.111We remove inconsistent datapoints where the error index is greater than the trajectory length\.Who&When is small\-scale and provides limited variety over agents and tasks, but it is the only existing academic dataset with step\-level error annotations by humans\.
Figure 6:Normalized decisive error position distributions for the real\-world benchmark \(Who&When\), and controlled error injection datasets \(Left/Mid/Right Dense\) on GSM8k\.In the absence of other real\-world data, and to provide greater variety, we follow existing practices\[[17](https://arxiv.org/html/2605.06788#bib.bib27)\]to synthetically generate failed agent trajectories through error injection\. To induce errors at controlled steps, we inject instructions to create errors into the agent’s prompt, using the error mode taxonomy ofCemriet al\.\[[4](https://arxiv.org/html/2605.06788#bib.bib26)\]\. For a given trajectory, one step is selected uniformly at random, and the agent is restarted in that state with a modified prompt describing the desired error mode\. Given that the overall trajectory fails, the injected step is labeled as the decisive error\. We use two mathematical reasoning datasets, MATH\[[12](https://arxiv.org/html/2605.06788#bib.bib29)\]and GSM8k\[[5](https://arxiv.org/html/2605.06788#bib.bib28)\]\(MIT licenses\), paired with two representative multi\-agent architectures, MACNET\[[22](https://arxiv.org/html/2605.06788#bib.bib30)\]and DyLAN\[[20](https://arxiv.org/html/2605.06788#bib.bib31)\]\. For each dataset–architecture combination, we generate 1200 failed trajectories\.
Complex tasks generally have some steps that are harder to complete than others, leading to non\-uniform decisive error distributions\. This distribution is shown in[Figure˜6](https://arxiv.org/html/2605.06788#S4.F6)for the Who&When dataset, which tends to have decisive errors early on\. To mimic this property, and explore the impact of distribution on error attribution performance, we construct variants of the synthetic data by dividing the normalized trajectory length into thirds, and subsampling conditional on decisive error location\. These are theLeft,Mid, andRight Densevariants of GSM8k in[Figure˜6](https://arxiv.org/html/2605.06788#S4.F6)\. Additional details and analysis are provided in Appendix[C](https://arxiv.org/html/2605.06788#A3)\.
### 4\.2Evaluation Metrics
Scoring functionsggcan be viewed as classifiers on anℓ\\ell\-way task—predicting the decisive error location\. Hence, we evaluate them using AUROC, AUPRC, and accuracy\. Sinceℓ\\ellcan be large, baseline levels of these metrics are very low\.
Given a test dataset\{\(xi,yi∗\)\}i=n\+1n\+m\\\{\(x\_\{i\},y\_\{i\}^\{\*\}\)\\\}\_\{i=n\+1\}^\{n\+m\}and predicted sets\{C\(xi\)\}i=n\+1n\+m\\\{C\(x\_\{i\}\)\\\}\_\{i=n\+1\}^\{n\+m\}, we evaluate conformal agent error attribution with the following metrics:
EC=1m∑i=n\+1n\+m𝕀\[yi∗∈C\(xi\)\];RR=1m∑i=n\+1n\+m\(1−\|C\(xi\)\|ℓi\)\.\\text\{EC\}\\;=\\;\\frac\{1\}\{m\}\\sum\_\{i=n\+1\}^\{n\+m\}\\mathbb\{I\}\\\!\\left\[y\_\{i\}^\{\*\}\\in C\(x\_\{i\}\)\\right\];\\qquad\\text\{RR\}\\;=\\;\\frac\{1\}\{m\}\\sum\_\{i=n\+1\}^\{n\+m\}\\left\(1\-\\frac\{\|C\(x\_\{i\}\)\|\}\{\\ell\_\{i\}\}\\right\)\.\(9\)Empirical Coverage\(EC\) measures how often the predicted set contains the decisive error\. EC should be at least1−α1\-\\alpha, the target coverage, and respect the corresponding upper bound where applicable\.Removal Rate\(RR\) measures how much of the trajectory is filtered out, essentially measuring prediction set size, whereℓi\\ell\_\{i\}denotes the number of steps inxix\_\{i\}\. A higher removal rate indicates smaller conformal sets, and more precise localization of the error step, given equal EC\.
Finally, rollback performance is measured across three axes: Success Rate \- the fraction of tasks successfully completed after the rollback; Coverage \- the fraction of tasks where the state is rolled back at least to the decisive error; and Cost \- the fraction of steps rolled back, needing to be redone\. As a baseline, we test the agent restarting from the single most likely predicted step \(Top\-1\)\.
## 5Results




Figure 7:Empirical Coverage vs\. Target Coverage for CP methods with the fine\-tuned LLM scoring function on MATH\.### 5\.1Empirical Verification of Coverage Guarantee
First, we empirically verify that CP algorithms considered in this work satisfy their respective coverage guarantees from[Section˜3](https://arxiv.org/html/2605.06788#S3)\.[Figure˜7](https://arxiv.org/html/2605.06788#S5.F7)shows empirical coverage as a function of the target coverage1−α1\-\\alpha, where we average EC over 1000 random, even splits of the calibration and test sets\. The shaded regions show one standard deviation\. EC is above the lower bound for all methods, consistent with the theoretical guarantees\. VCP, RF, and TWF all obey their respective upper bounds, leading to exact linear behaviour\. A deviation is observed in the low1−α1\-\\alpharegime for CRSVP, but this is not a violation as CRSVP does not provide an upper bound on coverage\.
### 5\.2Scoring Function Evaluation
Table 2:Step\-level discrimination metrics across scoring functions on combined MATH and GSM8k test set\.Scoring functionAUROCAUPRCAccuracyRandom Guessing \(baseline\)0\.5000\.1180\.118Naive LLM \(gpt\-4o\-mini\)0\.5190\.1280\.110Role Prompted \(gpt\-4o\-mini\)0\.5540\.1610\.164Fine\-tuned \(Qwen3\-1\.7B\)0\.7620\.3820\.731Next, we compare the three regimes of scoring functions for discriminatory power as binary classifiers\.[Table˜2](https://arxiv.org/html/2605.06788#S5.T2)shows results on a combination of GSM8k and MATH test sets, matching the training distribution for the fine\-tuned LLM\. Increasing the complexity ofggdoes pay off; fine\-tuning on the error classification task considerably increases its power\. The naive prompted LLM is hardly better than random guessing, while prompt engineering provides small improvements\. These results are consistent with performance levels observed in prior studies\[[32](https://arxiv.org/html/2605.06788#bib.bib1),[3](https://arxiv.org/html/2605.06788#bib.bib32),[17](https://arxiv.org/html/2605.06788#bib.bib27)\]\.
Table 3:Removal rate \(±\\pmstd over 1000 calibration/test splits\) for conformal algorithms and scoring functions at1−α=0\.81\-\\alpha=0\.8\. Higher removal rates indicate more precise error attribution, with the best performing conformal algorithm in each column bolded among algorithms which produce contiguous sets\. Averages are taken over conformal algorithms to evaluate scoring functions\.×\\bm\{\\times\}indicates that fine\-tuning is not possible due to insufficient data\.ConformalMethodWho&WhenGSM8k Left DenseGSM8k Mid DenseGSM8k Right DenseMATHDyLANMACNETDyLANMACNETDyLANMACNETDyLANMACNETNaive LLMgpt\-4o\-miniVanilla0\.20±\\pm0\.060\.21±\\pm0\.020\.24±\\pm0\.030\.18±\\pm0\.030\.15±\\pm0\.030\.16±\\pm0\.030\.17±\\pm0\.030\.19±\\pm0\.020\.20±\\pm0\.03CRSVP0\.21±\\pm0\.040\.58±\\pm0\.030\.54±\\pm0\.020\.29±\\pm0\.040\.21±\\pm0\.040\.21±\\pm0\.030\.10±\\pm0\.030\.30±\\pm0\.030\.20±\\pm0\.03Left Filtering0\.12±\\pm0\.040\.10±\\pm0\.010\.24±\\pm0\.010\.31±\\pm0\.010\.47±\\pm0\.020\.53±\\pm0\.010\.60±\\pm0\.020\.18±\\pm0\.020\.19±\\pm0\.02Right Filtering0\.31±\\pm0\.040\.64±\\pm0\.020\.71±\\pm0\.010\.43±\\pm0\.020\.45±\\pm0\.010\.20±\\pm0\.020\.20±\\pm0\.010\.27±\\pm0\.020\.20±\\pm0\.02Two\-Way0\.19±\\pm0\.040\.17±\\pm0\.030\.22±\\pm0\.030\.59±\\pm0\.030\.59±\\pm0\.010\.40±\\pm0\.030\.21±\\pm0\.030\.27±\\pm0\.020\.15±\\pm0\.03Average0\.210\.340\.390\.360\.370\.300\.260\.240\.19Role\-promptedgpt\-4o\-miniVanilla0\.19±\\pm0\.060\.30±\\pm0\.040\.24±\\pm0\.040\.27±\\pm0\.040\.27±\\pm0\.030\.22±\\pm0\.040\.25±\\pm0\.040\.19±\\pm0\.030\.24±\\pm0\.04CRSVP0\.23±\\pm0\.040\.27±\\pm0\.040\.21±\\pm0\.050\.27±\\pm0\.040\.26±\\pm0\.050\.27±\\pm0\.040\.22±\\pm0\.050\.27±\\pm0\.030\.22±\\pm0\.03Left Filtering0\.12±\\pm0\.040\.10±\\pm0\.020\.31±\\pm0\.010\.36±\\pm0\.010\.53±\\pm0\.010\.64±\\pm0\.020\.60±\\pm0\.020\.19±\\pm0\.020\.19±\\pm0\.02Right Filtering0\.30±\\pm0\.050\.63±\\pm0\.020\.59±\\pm0\.020\.43±\\pm0\.010\.33±\\pm0\.010\.21±\\pm0\.010\.10±\\pm0\.040\.28±\\pm0\.020\.19±\\pm0\.02Two\-Way0\.22±\\pm0\.020\.18±\\pm0\.020\.21±\\pm0\.020\.62±\\pm0\.020\.59±\\pm0\.020\.42±\\pm0\.020\.20±\\pm0\.020\.29±\\pm0\.020\.18±\\pm0\.02Average0\.210\.300\.310\.390\.400\.350\.270\.240\.20Fine\-tunedQwen3\-1\.7BVanilla×\\bm\{\\times\}0\.78±\\pm0\.030\.82±\\pm0\.020\.66±\\pm0\.030\.75±\\pm0\.020\.34±\\pm0\.040\.63±\\pm0\.030\.33±\\pm0\.030\.62±\\pm0\.03CRSVP×\\bm\{\\times\}0\.69±\\pm0\.060\.80±\\pm0\.020\.36±\\pm0\.050\.38±\\pm0\.070\.24±\\pm0\.030\.29±\\pm0\.060\.32±\\pm0\.030\.30±\\pm0\.04Left Filtering×\\bm\{\\times\}0\.09±\\pm0\.010\.10±\\pm0\.020\.31±\\pm0\.020\.35±\\pm0\.010\.52±\\pm0\.010\.60±\\pm0\.010\.18±\\pm0\.020\.19±\\pm0\.02Right Filtering×\\bm\{\\times\}0\.65±\\pm0\.020\.63±\\pm0\.020\.44±\\pm0\.010\.33±\\pm0\.010\.20±\\pm0\.010\.10±\\pm0\.020\.27±\\pm0\.010\.20±\\pm0\.02Two\-Way×\\bm\{\\times\}0\.18±\\pm0\.020\.20±\\pm0\.050\.63±\\pm0\.020\.60±\\pm0\.020\.40±\\pm0\.020\.19±\\pm0\.050\.29±\\pm0\.020\.19±\\pm0\.02Average×\\bm\{\\times\}0\.480\.510\.480\.480\.330\.360\.280\.30
### 5\.3Conformal Agent Error Attribution Evaluation
Our main comparison of error attribution examines the removal rate \([Equation˜9](https://arxiv.org/html/2605.06788#S4.E9)\) for conformal algorithms and scoring functions across real and synthetic datasets, including trajectories created through two agentic frameworks\.[Table˜3](https://arxiv.org/html/2605.06788#S5.T3)displays the results using a target coverage of80%80\\%\. Results are shown as the mean and standard deviation over 1000 random splits of the calibration and test data\.
Our first observation is that the efficacy of conformal algorithms depends on the distribution of errors in the dataset\. The real\-world Who&When dataset has decisive errors heavily skewed toward early steps in the trajectory \(see[Figure˜6](https://arxiv.org/html/2605.06788#S4.F6)\)\. This distribution naturally aligns with RF which can return a short prefix, and accordingly shows the strongest performance\. The synthetic datasets demonstrate this behaviour even more clearly\. LF is the strongest algorithm on right\-dense data, and TWF successfully isolates errors in the middle of trajectories\. Conformal error attribution is most successful when the CP algorithm is matched to the data’s error distribution\. Luckily, this is simple to achieve in practice\. Applying CP already requires a calibration dataset from a distribution similar to the test set\. Practitioners can visualize the error distribution of their calibration data like in[Figure˜6](https://arxiv.org/html/2605.06788#S4.F6), and choose the appropriate filtering algorithm, or automate the selection based on the mean error location of trajectories, for example\.
Notably, filtering algorithms are less dependent on the discriminatory power of the scoring function than VCP and CRSVP\. While the baseline approaches tend to require the fine\-tuned LLM scoring function to achieve high RR, filtering methods are largely insensitive\. This means that L/R/TWF can be used with simple, easier to design scoring functions as long as they are properly matched to the data’s error distribution\. This is especially important for real\-world data where labeling is expensive; Who&When only supplies 184 labeled datapoints—enough to calibrate and test, but not nearly enough to also fine\-tune an LLM\. Our filtering algorithms remain more practical in this setting\.
### 5\.4Computational Efficiency
Table 4:Inference cost comparison for conformal methods in terms of average NFE ofgg\(Fine\-tuned LLM\) per trajectory with 80% target coverage on GSM8k variants\.Conformal MethodLeft DenseMid DenseRight DenseVanilla8\.508\.508\.50CRSVP8\.508\.508\.50Left Filtering7\.505\.643\.70Right Filtering3\.085\.167\.09Two\-Way8\.508\.508\.50
As noted throughout[Section˜3](https://arxiv.org/html/2605.06788#S3)and[Table˜1](https://arxiv.org/html/2605.06788#S3.T1), at inference existing conformal methods including VCP and CRSVP require evaluating the scoring functionggon each step in the trajectory\. LF and RF need only evaluate steps from one direction until their threshold is crossed, which can be much more efficient\. To demonstrate this quantitatively, we count the average NFE ofggused by each algorithm when predicting on the GSM8k test set variants\. The result in[Table˜4](https://arxiv.org/html/2605.06788#S5.T4)confirms that VCP, CRSVP, and TWF use exactlyℓ\\ellfunction calls per trajectory\. In contrast, LF shows a clear advantage on right\-dense error distributions, while RF uses a mere 36% of the calls on left\-dense data where most errors occur early\. This result emphasizes the cost advantages of selecting an algorithm that has good implicit bias for the distribution observed in calibration data\.
### 5\.5Conformal Agent Rollbacks
Table 5:Automated rollback results using CP sets from VCP and LF compared to Top\-1 predictions\. Datasets are variants of GSM8k, and the scoring function uses the fine\-tuned LLM\.DatasetMethodSuccess Rate↑\\uparrowCoverageCost↓\\downarrowLeft DenseTop\-10\.76±\\pm0\.050\.92±\\pm0\.030\.83±\\pm0\.02VCP0\.73±\\pm0\.050\.85±\\pm0\.030\.79±\\pm0\.02LF0\.77±\\pm0\.040\.81±\\pm0\.050\.90±\\pm0\.00Mid DenseTop\-10\.67±\\pm0\.040\.86±\\pm0\.040\.56±\\pm0\.02VCP0\.59±\\pm0\.050\.95±\\pm0\.020\.65±\\pm0\.01LF0\.68±\\pm0\.050\.81±\\pm0\.030\.65±\\pm0\.01Right DenseTop\-10\.71±\\pm0\.050\.86±\\pm0\.040\.50±\\pm0\.02VCP0\.70±\\pm0\.051\.00±\\pm0\.000\.66±\\pm0\.02LF0\.75±\\pm0\.040\.82±\\pm0\.040\.40±\\pm0\.01As noted in[Section˜3\.3](https://arxiv.org/html/2605.06788#S3.SS3), CP sets can enable automated rollbacks for failed tasks with high confidence that decisive errors are captured\. We evaluate automated rollbacks on the GSM8K test set variants\. For each dataset, we randomly sample 100 MACNET trajectories and score them with the fine\-tuned LLM scoring function\. As a baseline method, we use the single step in the trajectory with the highest likelihood \(Top\-1\) as the restart point\. For conformal rollbacks we predict a conformal set using either the VCP or LF algorithm with a target coverage of 80%, and then restart at the first step in the set\. In all cases we restart the MACNET agent with a modified prompt containing the previously failed trace to help the system avoid repeating the same mistakes\. As shown in[Table˜5](https://arxiv.org/html/2605.06788#S5.T5), LF achieves the highest success rate by a small margin, within one standard deviation of baselines\. However, LF is the only method which gives control over coverage: Top\-1 gives no guarantee, while VCP overcovers in this setting because its sets are not contiguous\.
## 6Conclusion
In this work we designed novel set\-prediction algorithms for the agent error attribution task\. By taking account of the data’s sequential nature, our filtering\-based conformal algorithms showed strong ability to precisely isolate decisive errors in agent trajectories within contiguous sets, and with much less compute needed at inference time\. Our conformal algorithms can be used as one part of an agent improvement pipeline, allowing humans to more efficiently understand and debug errors in their agents, or enabling fully automated rollbacks wherein agents correct their own mistakes\. The limitations of our methods include: the requirement of exchangeable data, which could be loosened with standard CP methods to handle distribution shifts\[[9](https://arxiv.org/html/2605.06788#bib.bib44)\]; the need to match algorithm choice to data distribution, which can easily be done by checking the calibration dataset; and the assumption that failed trajectories have a single decisive error, which can be naturally expanded to the setting of an arbitrary number of agent errors by using a CP guarantee on the recall level over*all*errors\[[18](https://arxiv.org/html/2605.06788#bib.bib10)\]\.
## References
- \[1\]A\. N\. Angelopoulos, S\. Bates, M\. Jordan, and J\. Malik\(2021\)Uncertainty sets for image classifiers using conformal prediction\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.06788#A2.p2.17),[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p2.14)\.
- \[2\]A\. N\. Angelopoulos and S\. Bates\(2021\)A gentle introduction to conformal prediction and distribution\-free uncertainty quantification\.arXiv:2107\.07511\.Cited by:[§A\.1](https://arxiv.org/html/2605.06788#A1.SS1.p1.2),[§1](https://arxiv.org/html/2605.06788#S1.p2.1)\.
- \[3\]A\. Banerjee, A\. Nair, and T\. Borogovac\(2025\)Where did it all go wrong? a hierarchical look into multi\-agent error attribution\.InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling,Cited by:[§C\.1\.2](https://arxiv.org/html/2605.06788#A3.SS1.SSS2.p1.1),[§2\.1](https://arxiv.org/html/2605.06788#S2.SS1.p1.1),[item 2](https://arxiv.org/html/2605.06788#S3.I1.i2.p1.1),[§3\.3](https://arxiv.org/html/2605.06788#S3.SS3.p2.1),[§5\.2](https://arxiv.org/html/2605.06788#S5.SS2.p1.1)\.
- \[4\]M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. Parameswaran, D\. Klein, K\. Ramchandran, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica\(2025\)Why Do Multi\-Agent LLM Systems Fail?\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§C\.1\.1](https://arxiv.org/html/2605.06788#A3.SS1.SSS1.p1.1),[§1](https://arxiv.org/html/2605.06788#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p2.1)\.
- \[5\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p2.1)\.
- \[6\]J\. C\. Cresswell, B\. Kumar, Y\. Sui, and M\. Belbahri\(2025\)Conformal Prediction Sets Can Cause Disparate Impact\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p2.14)\.
- \[7\]J\. C\. Cresswell, Y\. Sui, B\. Kumar, and N\. Vouitsis\(2024\)Conformal Prediction Sets Improve Human Decision Making\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 9439–9457\.Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p2.14)\.
- \[8\]A\. Ghafarollahi and M\. J\. Buehler\(2024\)SciAgents: Automating Scientific Discovery Through Bioinspired Multi\-Agent Intelligent Graph Reasoning\.Advanced Materials37\(22\)\.External Links:[Document](https://dx.doi.org/10.1002/adma.202413523)Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p1.1)\.
- \[9\]I\. Gibbs and E\. Candes\(2021\)Adaptive conformal inference under distribution shift\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 1660–1672\.Cited by:[§6](https://arxiv.org/html/2605.06788#S6.p1.1)\.
- \[10\]T\. Guo, X\. Chen, Y\. Wang, R\. Chang, S\. Pei, N\. V\. Chawla, O\. Wiest, and X\. Zhang\(2024\)Large language model based multi\-agents: A survey of progress and challenges\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,External Links:ISBN 978\-1\-956792\-04\-1,[Document](https://dx.doi.org/10.24963/ijcai.2024/890)Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p1.1)\.
- \[11\]C\. Gupta, A\. K\. Kuchibhotla, and A\. Ramdas\(2022\)Nested conformal prediction and quantile out\-of\-bag ensemble methods\.Pattern Recognition127,pp\. 108496\.External Links:ISSN 0031\-3203,[Document](https://dx.doi.org/10.1016/j.patcog.2021.108496)Cited by:[Appendix B](https://arxiv.org/html/2605.06788#A2.p2.19)\.
- \[12\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring Mathematical Problem Solving With the MATH Dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,Vol\.1\.Cited by:[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p2.1)\.
- \[13\]F\. d\. Hengst, I\. Blin, M\. Mohammadi, S\. I\. H\. Shah, and T\. Younesian\(2025\)Hierarchical conformal classification\.arXiv:2508\.13288\.Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p2.1)\.
- \[14\]S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p1.1)\.
- \[15\]D\. Huang, J\. M\. Zhang, M\. Luck, Q\. Bu, Y\. Qing, and H\. Cui\(2023\)AgentCoder: Multi\-Agent\-based Code Generation with Iterative Testing and Optimisation\.arXiv:2312\.13010\.Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p1.1)\.
- \[16\]J\. Huang, H\. Xi, L\. Zhang, H\. Yao, Y\. Qiu, and H\. Wei\(2024\)Conformal prediction for deep classifier via label ranking\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p2.14)\.
- \[17\]F\. Kong, R\. Zhang, H\. Yin, G\. Zhang, X\. Zhang, Z\. Chen, Z\. Zhang, X\. Zhang, S\. Zhu, and X\. Feng\(2026\)Aegis: automated error generation and attribution for multi\-agent systems\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§C\.1\.1](https://arxiv.org/html/2605.06788#A3.SS1.SSS1.p1.1),[§1](https://arxiv.org/html/2605.06788#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.06788#S2.SS1.p1.1),[item 3](https://arxiv.org/html/2605.06788#S3.I1.i3.p1.1),[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.06788#S5.SS2.p1.1)\.
- \[18\]B\. Kuwahara, C\. Lin, X\. S\. Huang, K\. K\. Leung, J\. A\. Yapeter, I\. Stanevich, F\. Perez, and J\. C\. Cresswell\(2025\)Document summarization with conformal importance guarantees\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2\.3](https://arxiv.org/html/2605.06788#S2.SS3.p1.11),[§6](https://arxiv.org/html/2605.06788#S6.p1.1)\.
- \[19\]K\. K\. Leung, M\. Belbahri, Y\. Sui, A\. Labach, X\. Zhang, S\. A\. Rose, and J\. C\. Cresswell\(2026\)Classifying and addressing the diversity of errors in retrieval\-augmented generation systems\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3185–3207\.External Links:[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.147),ISBN 979\-8\-89176\-380\-7Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p1.1)\.
- \[20\]Z\. Liu, Y\. Zhang, P\. Li, Y\. Liu, and D\. Yang\(2024\)A Dynamic LLM\-Powered Agent Network for Task\-Oriented Agent Collaboration\.InFirst Conference on Language Modeling,Cited by:[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p2.1)\.
- \[21\]T\. Mortier, A\. Javanmardi, Y\. Sale, E\. Hüllermeier, and W\. Waegeman\(2026\)Conformal prediction in hierarchical classification with constrained representation complexity\.InThe 29th International Conference on Artificial Intelligence and Statistics,Cited by:[Appendix B](https://arxiv.org/html/2605.06788#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.06788#A2.p3.12),[§2\.3](https://arxiv.org/html/2605.06788#S2.SS3.p2.1),[§3\.1\.2](https://arxiv.org/html/2605.06788#S3.SS1.SSS2.p1.5)\.
- \[22\]C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, K\. Zhu, H\. Xia, Y\. Dang, Z\. Du, W\. Chen, C\. Yang, Z\. Liu, and M\. Sun\(2025\)Scaling Large Language Model\-based Multi\-Agent Collaboration\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p2.1)\.
- \[23\]N\. Razavian and D\. Sontag\(2015\)Temporal convolutional neural networks for diagnosis from lab tests\.arXiv:1511\.07938\.Cited by:[§3\.1\.4](https://arxiv.org/html/2605.06788#S3.SS1.SSS4.p6.4)\.
- \[24\]Y\. Romano, M\. Sesia, and E\. Candès\(2020\)Classification with valid and adaptive coverage\.InAdvances in Neural Information Processing Systems,Vol\.33\.Cited by:[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p2.14)\.
- \[25\]B\. L\. Ross, N\. Vouitsis, A\. A\. Ghomi, R\. Hosseinzadeh, J\. Xin, Z\. Liu, Y\. Sui, S\. Hou, K\. K\. Leung, G\. Loaiza\-Ganem, and J\. C\. Cresswell\(2026\)Textual bayes: quantifying prompt uncertainty in LLM\-based systems\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p2.1)\.
- \[26\]G\. Shafer and V\. Vovk\(2008\)A tutorial on conformal prediction\.\.Journal of Machine Learning Research9\(3\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p1.9)\.
- \[27\]V\. Vovk, A\. Gammerman, and G\. Shafer\(2005\)Algorithmic learning in a random world\.Springer\.Cited by:[§2\.2](https://arxiv.org/html/2605.06788#S2.SS2.p1.9)\.
- \[28\]N\. Wu, M\. Gong, L\. Shou, S\. Liang, and D\. Jiang\(2023\)Large language models are diverse role\-players for summarization evaluation\.InNatural Language Processing and Chinese Computing,pp\. 695–707\.External Links:ISBN 978\-3\-031\-44693\-1Cited by:[§C\.1\.2](https://arxiv.org/html/2605.06788#A3.SS1.SSS2.p1.1)\.
- \[29\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.Cited by:[§C\.1\.3](https://arxiv.org/html/2605.06788#A3.SS1.SSS3.Px2.p1.10),[item 3](https://arxiv.org/html/2605.06788#S3.I1.i3.p1.1)\.
- \[30\]Y\. Yu, Z\. Yao, H\. Li, Z\. Deng, Y\. Jiang, Y\. Cao, Z\. Chen, J\. W\. Suchow, Z\. Cui, R\. Liu, Z\. Xu, D\. Zhang, K\. Subbalakshmi, G\. Xiong, Y\. He, J\. Huang, D\. Li, and Q\. Xie\(2024\)FinCon: A Synthesized LLM Multi\-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 137010–137045\.External Links:[Document](https://dx.doi.org/10.52202/079017-4354)Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p1.1)\.
- \[31\]Y\. Yu, M\. Li, S\. Xu, J\. Fu, X\. Hou, F\. Lai, and B\. Wang\(2025\)CORRECT: COndensed eRror RECognition via knowledge Transfer in multi\-agent systems\.arXiv:2509\.24088\.Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.06788#S2.SS1.p1.1)\.
- \[32\]S\. Zhang, M\. Yin, J\. Zhang, J\. Liu, Z\. Han, J\. Zhang, B\. Li, C\. Wang, H\. Wang, Y\. Chen, and Q\. Wu\(2025\)Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi\-Agent Systems\.InForty\-second International Conference on Machine Learning,Cited by:[§C\.1\.3](https://arxiv.org/html/2605.06788#A3.SS1.SSS3.Px1.p1.1),[§1](https://arxiv.org/html/2605.06788#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.06788#S2.SS1.p1.1),[item 1](https://arxiv.org/html/2605.06788#S3.I1.i1.p1.1),[§3\.3](https://arxiv.org/html/2605.06788#S3.SS3.p2.1),[§4\.1](https://arxiv.org/html/2605.06788#S4.SS1.p1.1),[§5\.2](https://arxiv.org/html/2605.06788#S5.SS2.p1.1)\.
- \[33\]C\. Zhu, S\. Hong, J\. Wu, K\. Chawla, Y\. Tang, Y\. Yin, N\. Wolfe, E\. Babinsky, and D\. Liu\(2025\)RAFFLES: reasoning\-based attribution of faults for LLM systems\.InFirst Workshop on Multi\-Turn Interactions in Large Language Models,Cited by:[§1](https://arxiv.org/html/2605.06788#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.06788#S2.SS1.p1.1),[§3\.3](https://arxiv.org/html/2605.06788#S3.SS3.p2.1)\.
## Appendix ATheorems and Proofs
In this appendix we provide complete proofs of the theorems stated in the main text\.
### A\.1Vanilla Conformal Prediction
The theorem showing valid coverage for VCP with variable number of classes is essentially the same as the standard result for conformal prediction, for instance as presented by\[[2](https://arxiv.org/html/2605.06788#bib.bib25)\], since the number of classes per datapoint does not affect exchangeability of the conformal scores\. Although not novel, we present the theorem below for completeness\. Note that following tradition in CP, the scores in this algorithm reflect*non\-conformity*in that they take higher values whenyydisagrees withxx\.
\{restatable\}
theoremVCPTheorem Suppose\{\(xi,yi∗\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}^\{\*\}\)\\\}\_\{i=1\}^\{n\}and\(xn\+1,yn\+1∗\)\(x\_\{n\+1\},y\_\{n\+1\}^\{\*\}\)are exchangeable, where the number of possible classesℓi\\ell\_\{i\}varies per datapoint\. Defineq^\\hat\{q\}as the⌈\(n\+1\)\(1−α\)⌉n\\tfrac\{\\lceil\{\(n\+1\)\(1\-\\alpha\)\}\\rceil\}\{n\}quantile of the scores\{Si\}i=1n=\{S\(xi,yi∗\)\}i=1n\\\{S\_\{i\}\\\}\_\{i=1\}^\{n\}=\\\{S\(x\_\{i\},y\_\{i\}^\{\*\}\)\\\}\_\{i=1\}^\{n\}\. Then prediction sets constructed asC\(xn\+1;q^\)=\{y∈𝒴\(xn\+1\)∣S\(xn\+1,y\)≤q^\}C\(x\_\{n\+1\};\\hat\{q\}\)=\\\{y\\in\\mathcal\{Y\}\(x\_\{n\+1\}\)\\mid S\(x\_\{n\+1\},y\)\\leq\\hat\{q\}\\\}satisfyℙ\[yn\+1∗∈C\(xn\+1;q^\)\]≥1−α\\mathbb\{P\}\[y\_\{n\+1\}^\{\*\}\\in C\(x\_\{n\+1\};\\hat\{q\}\)\]\\geq 1\-\\alpha\.
###### Proof\.
Since the fixed scoring functionSSis applied to each element of the exchangeable sequence\(x1,y1∗\),…,\(xn\+1,yn\+1∗\)\(x\_\{1\},y^\{\*\}\_\{1\}\),\\dots,\(x\_\{n\+1\},y^\{\*\}\_\{n\+1\}\), the resulting sequence of scalar scoresS1,…,Sn\+1S\_\{1\},\\dots,S\_\{n\+1\}is also exchangeable\. We assume for simplicity that the scores are non\-degenerate, but ties can be handled by definingSSto add a small amount of noise to its output\. LetS\(1\)≤S\(2\)≤⋯≤S\(n\)S\_\{\(1\)\}\\leq S\_\{\(2\)\}\\leq\\dots\\leq S\_\{\(n\)\}be the order statistics of the calibration scores so that the empirical quantileq^\\hat\{q\}can be defined asS\(k\)S\_\{\(k\)\}, wherek=⌈\(n\+1\)\(1−α\)⌉k=\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\(assumingn≥1α−1n\\geq\\tfrac\{1\}\{\\alpha\}\-1to ensurek≤nk\\leq n\)\. Given the definition ofC\(xn\+1;q^\)C\(x\_\{n\+1\};\\hat\{q\}\), the eventyn\+1∗∈C\(xn\+1;q^\)y\_\{n\+1\}^\{\*\}\\in C\(x\_\{n\+1\};\\hat\{q\}\)is equivalent to the eventSn\+1≤q^S\_\{n\+1\}\\leq\\hat\{q\}, which occurs if and only ifSn\+1S\_\{n\+1\}is among thekksmallest scores overall\. From exchangeability, all orderings are equally likely, soℙ\[Sn\+1≤q^\]=kn\+1\\mathbb\{P\}\[S\_\{n\+1\}\\leq\\hat\{q\}\]=\\frac\{k\}\{n\+1\}\. Altogether we have
ℙ\[yn\+1∗∈C\(xn\+1;q^\)\]=kn\+1=⌈\(n\+1\)\(1−α\)⌉n\+1≥\(n\+1\)\(1−α\)n\+1=1−α\.\\mathbb\{P\}\[y\_\{n\+1\}^\{\*\}\\in C\(x\_\{n\+1\};\\hat\{q\}\)\]=\\frac\{k\}\{n\+1\}=\\frac\{\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\}\{n\+1\}\\geq\\frac\{\(n\+1\)\(1\-\\alpha\)\}\{n\+1\}=1\-\\alpha\.\(10\)∎
We note that the variable number of classesℓi\\ell\_\{i\}per datapoint does not come up in the proof, and is only present in the assumption that such datapoints can still be exchangeable\. This is certainly possible when we treatℓi\\ell\_\{i\}as a \(random\) property ofxx\. If desired, one could treatℓi\\ell\_\{i\}as an explicit random variable and draw points as\(xi,ℓi,yi∗\)∼ℙ\(x\_\{i\},\\ell\_\{i\},y^\{\*\}\_\{i\}\)\\sim\\mathbb\{P\}\. Another way of viewing this extension would be to set a fixedℓ\\ellas some upper bound of possible sizes, such asℓ=max\{ℓi\}\\ell=\\max\\\{\\ell\_\{i\}\\\}\. Then, ordinary conformal prediction can be performed with a modelffdefined overℓ\\ellclasses but givingfj=0f\_\{j\}=0whenj\>ℓij\>\\ell\_\{i\}, as long as we assumeℓn\+1≤ℓ\\ell\_\{n\+1\}\\leq\\ell\.
### A\.2Left \(Right\) Filtration
Here we provide full formal proofs for each of the steps in[Section˜3\.1\.3](https://arxiv.org/html/2605.06788#S3.SS1.SSS3)\. For ease of reference, we repeat key definitions and assumptions\. A trajectoryxxis a sequence of a variable numberℓ\\ellsteps\(c1,…,cℓ\)\(c\_\{1\},\\dots,c\_\{\\ell\}\), and the decisive errory∗=cj∗y^\{\*\}=c\_\{j^\{\*\}\}is the step at indexj∗j^\{\*\}\. The scoring functiongLFg\_\{\\text\{LF\}\}scores subintervalscj:k⊆xc\_\{j:k\}\\subseteq xwith the propertiesgLF\(∅\)=0g\_\{\\text\{LF\}\}\(\\varnothing\)=0andgLF\(x\)=∞g\_\{\\text\{LF\}\}\(x\)=\\infty\. The filtering function is defined asFLF\(x;q\):=argmaxcj:ℓ∈ℐLF\(\|cj:ℓ\|∣gLF\(cj:ℓ\)≤q\)F\_\{\\text\{LF\}\}\(x;q\):=\\operatorname\*\{arg\\,max\}\_\{c\_\{j:\\ell\}\\in\\mathcal\{I\}\_\{\\text\{LF\}\}\}\\big\(\|c\_\{j:\\ell\}\|\\mid g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\leq q\\big\)which optimizes over suffixesℐLF:=\{cj:ℓ\}j=1ℓ\+1\\mathcal\{I\}\_\{\\text\{LF\}\}:=\\\{c\_\{j:\\ell\}\\\}\_\{j=1\}^\{\\ell\+1\}and satisfiesFLF\(x;∞\)=xF\_\{\\text\{LF\}\}\(x;\\infty\)=x\. Meanwhile, the conformal score function is defined asSLF\(x,y∗\):=inf\(q∈ℝ\+∣y∗∈FLF\(x;q\)\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\):=\\inf\\big\(q\\in\\mathbb\{R^\{\+\}\}\\mid y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\)\\big\)\. Both of these functions optimize over a set of steps, and it will be convenient to refer to those sets as follows:
𝒱q\\displaystyle\\mathcal\{V\}\_\{q\}:=\{cj:ℓ∈ℐLF∣gLF\(cj:ℓ\)≤q\},\\displaystyle:=\\\{c\_\{j:\\ell\}\\in\\mathcal\{I\}\_\{\\text\{LF\}\}\\mid g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\leq q\\\},\(11\)𝒬\\displaystyle\\mathcal\{Q\}:=\{q∈ℝ\+∣y∗∈FLF\(x;q\)\}\.\\displaystyle:=\\\{q\\in\\mathbb\{R^\{\+\}\}\\mid y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\)\\\}\.\(12\)
###### Proof\.
𝒱q\\mathcal\{V\}\_\{q\}is the set of valid suffixes forFLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)\. For everycj:ℓ∈𝒱q1c\_\{j:\\ell\}\\in\\mathcal\{V\}\_\{q\_\{1\}\}we havegLF\(cj:ℓ\)≤q1≤q2g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\leq q\_\{1\}\\leq q\_\{2\}, and socj:ℓ∈𝒱q2c\_\{j:\\ell\}\\in\\mathcal\{V\}\_\{q\_\{2\}\}\. Hence,𝒱q1⊆𝒱q2\\mathcal\{V\}\_\{q\_\{1\}\}\\subseteq\\mathcal\{V\}\_\{q\_\{2\}\}\.FLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)returns the longest suffix in𝒱q\\mathcal\{V\}\_\{q\}, so we also have\|FLF\(x;q1\)\|≤\|FLF\(x;q2\)\|\|F\_\{\\text\{LF\}\}\(x;q\_\{1\}\)\|\\leq\|F\_\{\\text\{LF\}\}\(x;q\_\{2\}\)\|\. Since the suffixescj:ℓ∈ℐLFc\_\{j:\\ell\}\\in\\mathcal\{I\}\_\{\\text\{LF\}\}are nested \(i\.e\.j2<j1⇔cj1:ℓ⊂cj2:ℓj\_\{2\}<j\_\{1\}\\iff c\_\{j\_\{1\}:\\ell\}\\subset c\_\{j\_\{2\}:\\ell\}\), we also have\|cj1:ℓ\|≤\|cj2:ℓ\|⟹cj1:ℓ⊆cj2:ℓ\|c\_\{j\_\{1\}:\\ell\}\|\\leq\|c\_\{j\_\{2\}:\\ell\}\|\\implies c\_\{j\_\{1\}:\\ell\}\\subseteq c\_\{j\_\{2\}:\\ell\}, which gives the desired nesting property ofFLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)\. ∎
###### Proof\.
𝒬\\mathcal\{Q\}is the set of valid thresholds forSLF\(x,y∗\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\. Letq∗=gLF\(cj∗:ℓ\)q^\{\*\}=g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\), and note thatcj∗:ℓ∈𝒱q∗c\_\{j^\{\*\}:\\ell\}\\in\\mathcal\{V\}\_\{q^\{\*\}\}\. Since\|cj∗:ℓ\|<\|cj:ℓ\|\|c\_\{j^\{\*\}:\\ell\}\|<\|c\_\{j:\\ell\}\|only whenj<j∗j<j^\{\*\}, andy∗∈cj:ℓy^\{\*\}\\in c\_\{j:\\ell\}for these cases, the longest suffix in𝒱q∗\\mathcal\{V\}\_\{q^\{\*\}\}containsy∗y^\{\*\}\. Hence, we have thaty∗∈FLF\(x;q∗\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q^\{\*\}\), andq∗∈𝒬q^\{\*\}\\in\\mathcal\{Q\}\.
Now consider anyq′<q∗q^\{\\prime\}<q^\{\*\}\. Notice thatcj∗:ℓ∉𝒱q′c\_\{j^\{\*\}:\\ell\}\\not\\in\\mathcal\{V\}\_\{q^\{\\prime\}\}\. By monotonicity ofgLFg\_\{\\text\{LF\}\}, for allj<j∗j<j^\{\*\},gLF\(cj:ℓ\)≥q∗\>q′g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\geq q^\{\*\}\>q^\{\\prime\}\. Therefore, the longest suffix in𝒱q′\\mathcal\{V\}\_\{q^\{\\prime\}\}is shorter thancj∗:ℓc\_\{j^\{\*\}:\\ell\}, and does not containy∗y^\{\*\}\. Hence,q′∉𝒬q^\{\\prime\}\\not\\in\\mathcal\{Q\}\.
Sinceq∗∈𝒬q^\{\*\}\\in\\mathcal\{Q\}, but allq′<q∗q^\{\\prime\}<q^\{\*\}are not,q∗q^\{\*\}is the minimum of𝒬\\mathcal\{Q\}, andSLF\(x,y∗\)=q∗=gLF\(cj∗:ℓ\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)=q^\{\*\}=g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\)\. ∎
###### Proof\.
To begin, we show that the infimum inSLF\(x,y∗\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)is always achieved\. Note that the set of suffix scoresgLF\(cj:ℓ\)g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)is finite\. Considering only the valid suffixes that containy∗y^\{\*\}, letqmin=min\{gLF\(cj:ℓ\)∣j≤j∗\}q\_\{\\min\}=\\min\\\{g\_\{\\text\{LF\}\}\(c\_\{j:\\ell\}\)\\mid j\\leq j^\{\*\}\\\}\. Due to the nesting of suffixes, the conditiony∗∈FLF\(x;q\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\)is satisfied if and only if there exists a valid suffix with score≤q\\leq q, which occurs precisely whenqmin≤qq\_\{\\min\}\\leq q\. Consequently,𝒬\\mathcal\{Q\}is the closed interval\[qmin,∞\)\[q\_\{\\min\},\\infty\), and the infimum inSLF\(x,y∗\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)is achieved atqminq\_\{\\text\{min\}\}\.
Hence we have a chain of equivalences:
SLF\(x,y∗\)≤q^⇔qmin≤q^⇔q^∈𝒬⇔y∗∈FLF\(x;q^\)\.S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\\iff q\_\{\\text\{min\}\}\\leq\\hat\{q\}\\iff\\hat\{q\}\\in\\mathcal\{Q\}\\iff y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;\\hat\{q\}\)\.\(13\)
∎
### A\.3Two\-Way Filtration
Next we cover the proofs of statements in[Section˜3\.1\.4](https://arxiv.org/html/2605.06788#S3.SS1.SSS4)\. Continuing from the definitions and results above, including their counterparts for RF, for TWF we hadFTWF\(x;q\):=FLF\(x;q\)∩FRF\(x;q\)F\_\{\\text\{TWF\}\}\(x;q\):=F\_\{\\text\{LF\}\}\(x;q\)\\cap F\_\{\\text\{RF\}\}\(x;q\)\(whereFTWF\(x;0\)=∅F\_\{\\text\{TWF\}\}\(x;0\)=\\varnothingandFTWF\(x;∞\)=xF\_\{\\text\{TWF\}\}\(x;\\infty\)=x\), andSTWF\(x,y∗\):=inf\(q∈ℝ\+∣y∗∈FTWF\(x;q\)\)S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\):=\\inf\\big\(q\\in\\mathbb\{R^\{\+\}\}\\mid y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x;q\)\\big\)\.
###### Proof\.
From[Equation˜3](https://arxiv.org/html/2605.06788#S3.E3)we haveFLF\(x;q1\)⊆FLF\(x;q2\)F\_\{\\text\{LF\}\}\(x;q\_\{1\}\)\\subseteq F\_\{\\text\{LF\}\}\(x;q\_\{2\}\), and the equivalent statement forFRFF\_\{\\text\{RF\}\}holds\. The intersection operation preserves set inclusion: for any setsA,B,C,DA,B,C,D, ifA⊆BA\\subseteq BandC⊆DC\\subseteq D, thenA∩C⊆B∩DA\\cap C\\subseteq B\\cap D\. Hence,
FTWF\(x;q1\)=FLF\(x;q1\)∩FRF\(x;q1\)⊆FLF\(x;q2\)∩FRF\(x;q2\)=FTWF\(x;q2\)\.F\_\{\\text\{TWF\}\}\(x;q\_\{1\}\)=F\_\{\\text\{LF\}\}\(x;q\_\{1\}\)\\cap F\_\{\\text\{RF\}\}\(x;q\_\{1\}\)\\subseteq F\_\{\\text\{LF\}\}\(x;q\_\{2\}\)\\cap F\_\{\\text\{RF\}\}\(x;q\_\{2\}\)=F\_\{\\text\{TWF\}\}\(x;q\_\{2\}\)\.\(14\)∎
###### Proof\.
By the definition ofFTWFF\_\{\\text\{TWF\}\},y∗∈FTWF\(x;q\)y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x;q\)if and only ify∗∈FLF\(x;q\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\)andy∗∈FRF\(x;q\)y^\{\*\}\\in F\_\{\\text\{RF\}\}\(x;q\)\. The conditiony∗∈FLF\(x;q\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x;q\)holds if and only ifq≥SLF\(x,y∗\)q\\geq S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)by the definition ofFLF\(x;q\)F\_\{\\text\{LF\}\}\(x;q\)and its nesting property from[Equation˜3](https://arxiv.org/html/2605.06788#S3.E3)\. Similarly,y∗∈FRF\(x;q\)y^\{\*\}\\in F\_\{\\text\{RF\}\}\(x;q\)if and only ifq≥SRF\(x,y∗\)q\\geq S\_\{\\text\{RF\}\}\(x,y^\{\*\}\)\. Hence,y∗∈FTWF\(x;q\)y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x;q\)if and only ifq≥max\(SLF\(x;y∗\),SRF\(x;y∗\)\)q\\geq\\max\(S\_\{\\text\{LF\}\}\(x;y^\{\*\}\),S\_\{\\text\{RF\}\}\(x;y^\{\*\}\)\)\. By taking the infimum forSTWF\(x;y∗\)S\_\{\\text\{TWF\}\}\(x;y^\{\*\}\), we reach[Equation˜7](https://arxiv.org/html/2605.06788#S3.E7)\.
From[Equation˜4](https://arxiv.org/html/2605.06788#S3.E4), whengLFg\_\{\\text\{LF\}\}obeys its monotonicity condition, we haveSLF\(x,y∗\)=gLF\(cj∗:ℓ\)S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)=g\_\{\\text\{LF\}\}\(c\_\{j^\{\*\}:\\ell\}\), and by a similar argumentSRF\(x;y∗\)=gRF\(c1:j∗\)S\_\{\\text\{RF\}\}\(x;y^\{\*\}\)=g\_\{\\text\{RF\}\}\(c\_\{1:j^\{\*\}\}\)\. Substituting these results into[Equation˜7](https://arxiv.org/html/2605.06788#S3.E7)gives[Equation˜8](https://arxiv.org/html/2605.06788#S3.E8)\. ∎
###### Proof\.
From the definition ofFTWF\(x,q\)F\_\{\\text\{TWF\}\}\(x,q\)as an intersection, the eventy∗∈FTWF\(x,q^\)y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x,\\hat\{q\}\)occurs if and only if bothy∗∈FLF\(x,q^\)y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x,\\hat\{q\}\)andy∗∈FRF\(x,q^\)y^\{\*\}\\in F\_\{\\text\{RF\}\}\(x,\\hat\{q\}\)occur\. We have already shown thaty∗∈FLF\(x,q^\)⇔SLF\(x,y∗\)≤q^y^\{\*\}\\in F\_\{\\text\{LF\}\}\(x,\\hat\{q\}\)\\iff S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\([Figure˜3](https://arxiv.org/html/2605.06788#S3.F3)\), and similarly for RF\. Hence,
y∗∈FTWF\(x,q^\)⇔SLF\(x,y∗\)≤q^∧SRF\(x,y∗\)≤q^\.y^\{\*\}\\in F\_\{\\text\{TWF\}\}\(x,\\hat\{q\}\)\\iff S\_\{\\text\{LF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\\wedge S\_\{\\text\{RF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\.\(15\)Equivalently we writemax\(SLF\(x,y∗\),SRF\(x,y∗\)\)≤q^\\max\\big\(S\_\{\\text\{LF\}\}\(x,y^\{\*\}\),S\_\{\\text\{RF\}\}\(x,y^\{\*\}\)\\big\)\\leq\\hat\{q\}, which is the same asSTWF\(x,y∗\)≤q^S\_\{\\text\{TWF\}\}\(x,y^\{\*\}\)\\leq\\hat\{q\}\([Figure˜4](https://arxiv.org/html/2605.06788#S3.F4)\)\. ∎
\{restatable\}
theoremTWFTheorem Suppose\{\(xi,yi∗\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}^\{\*\}\)\\\}\_\{i=1\}^\{n\}and\(xn\+1,yn\+1∗\)\(x\_\{n\+1\},y\_\{n\+1\}^\{\*\}\)are exchangeable\. Given scoring functionsgLFg\_\{\\text\{LF\}\}andgRFg\_\{\\text\{RF\}\}as defined above, define the conformal scoreSTWF\(xi,yi∗\)S\_\{\\text\{TWF\}\}\(x\_\{i\},y\_\{i\}^\{\*\}\)as in[Equation˜6](https://arxiv.org/html/2605.06788#S3.E6)\. Letq^\\hat\{q\}be the⌈\(n\+1\)\(1−α\)⌉n\\tfrac\{\\lceil\{\(n\+1\)\(1\-\\alpha\)\}\\rceil\}\{n\}quantile of conformal scores\{Si\}i=1n=\{STWF\(xi,yi∗\)\}i=1n\\\{S\_\{i\}\\\}\_\{i=1\}^\{n\}=\\\{S\_\{\\text\{TWF\}\}\(x\_\{i\},y^\{\*\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}\. Then prediction sets constructed asCTWF\(xn\+1;q^\)=FTWF\(xn\+1;q^\)C\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)=F\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)satisfy1−α≤ℙ\[yn\+1∗∈CTWF\(xn\+1;q^\)\]<1−α\+1n\+11\-\\alpha\\leq\\mathbb\{P\}\[y\_\{n\+1\}^\{\*\}\\in C\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)\]<1\-\\alpha\+\\frac\{1\}\{n\+1\}\.
###### Proof\.
The proof proceeds in the same way as[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3), using[Figure˜4](https://arxiv.org/html/2605.06788#S3.F4)instead of[Figure˜3](https://arxiv.org/html/2605.06788#S3.F3)\. ∎
As a final point, we note that prediction sets constructed asCTWF\(xn\+1;q^\)=FTWF\(xn\+1;q^\)C\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)=F\_\{\\text\{TWF\}\}\(x\_\{n\+1\};\\hat\{q\}\)may be empty due to the use of intersection\. While this is not strictly a concern as coverage is valid marginally, it can be undesirable to predict empty sets\. As a fallback, TWF can predict the single most likely step undergLFg\_\{\\text\{LF\}\}orgRFg\_\{\\text\{RF\}\}\(which are likely to be the same function in practice\)\.
## Appendix BAdditional Details of Conformal Algorithms
In[Section˜3\.1\.2](https://arxiv.org/html/2605.06788#S3.SS1.SSS2)we described how to apply the CRSVP algorithm byMortieret al\.\[[21](https://arxiv.org/html/2605.06788#bib.bib4)\], originally designed for hierarchical classification, to sequential data like agent trajectories\. Here we provide a more detailed description of our implementation\.
To recap, a binary tree𝒯\\mathcal\{T\}is constructed from an agent trajectoryx=\(c1,…,cℓ\)x=\(c\_\{1\},\\dots,c\_\{\\ell\}\)to have root nodev1=\[ℓ\]v\_\{1\}=\[\\ell\], and leaf nodesvℓ,…,v2ℓ−1v\_\{\\ell\},\\dots,v\_\{2\\ell\-1\}as individual stepsc1,…,cℓc\_\{1\},\\dots,c\_\{\\ell\}\. For each calibration datapoint\(x,y∗\)\(x,y^\{\*\}\), a scoring functiongCRSVPg\_\{\\text\{CRSVP\}\}is used to find the single leafv¯\\bar\{v\}most likely to contain the errory∗y^\{\*\}\. Ifv¯\\bar\{v\}is itself the decisive errory∗y^\{\*\}, then the conformal score is returned asSCRSVP\(x,y∗\):=gCRSVP\(v¯\)S\_\{\\text\{CRSVP\}\}\(x,y^\{\*\}\):=g\_\{\\text\{CRSVP\}\}\(\\bar\{v\}\)\. Otherwise, the tree is traversed upwards until the first nodev∗v^\{\*\}is reached which does containy∗y^\{\*\}\. Since the number of steps in each node grows exponentially as the tree is traversed, simply using the scoregCRSVP\(v∗\)g\_\{\\text\{CRSVP\}\}\(v^\{\*\}\)would greatly overcover the true labels\. Hence, CRSVP interpolates between steps a random amount, similar to other conformal scoring functions\[[1](https://arxiv.org/html/2605.06788#bib.bib13)\]\. Letv∗−1v^\{\*\-1\}be the node beforev∗v^\{\*\}on the traversal path\. Then the conformal score is set as
SCRSVP\(x,y∗\):=gCRSVP\(v∗\)−u⋅\(gCRSVP\(v∗\)−gCRSVP\(v∗−1\)\),S\_\{\\text\{CRSVP\}\}\(x,y^\{\*\}\):=g\_\{\\text\{CRSVP\}\}\(v^\{\*\}\)\-u\\cdot\(g\_\{\\text\{CRSVP\}\}\(v^\{\*\}\)\-g\_\{\\text\{CRSVP\}\}\(v^\{\*\-1\}\)\),\(16\)whereu∼𝒰\(0,1\)u\\sim\\mathcal\{U\}\(0,1\)is noise drawn uniformly from\[0,1\]\[0,1\]\. Note that parent nodes are strict supersets of their children, which means prediction sets are nested when traversing from leaf to root\[[11](https://arxiv.org/html/2605.06788#bib.bib24)\]\.
After conformal calibration to get the⌊\(n\+1\)\(α\)⌋n\\frac\{\\lfloor\(n\+1\)\(\\alpha\)\\rfloor\}\{n\}quantile of the scores222Note that the quantile is different than VCP because we use conformity scores in CRSVP, and1−g1\-gas a nonconformity score for VCP\. These are merely conventions and can easily be modified\., which we denoteq^\\hat\{q\}, for a test datapointxn\+1x\_\{n\+1\}, CRSVP again finds the most likely leaf nodev¯\\bar\{v\}according togCRSVPg\_\{\\text\{CRSVP\}\}\. If that node’s score is below the threshold,gCRSVP\(v¯\)<q^g\_\{\\text\{CRSVP\}\}\(\\bar\{v\}\)<\\hat\{q\}, CRSVP recursively traverses up the tree until the threshold is surpassed atv^\\hat\{v\}\. Drawing noiseuu, ifgCRSVP\(v^\)−u⋅\(gCRSVP\(v^\)−gCRSVP\(v^−1\)\)≥q^g\_\{\\text\{CRSVP\}\}\(\\hat\{v\}\)\-u\\cdot\(g\_\{\\text\{CRSVP\}\}\(\\hat\{v\}\)\-g\_\{\\text\{CRSVP\}\}\(\\hat\{v\}^\{\-1\}\)\)\\geq\\hat\{q\}, thenv^−1\\hat\{v\}^\{\-1\}is returned, otherwisev^\\hat\{v\}itself is\. By construction all prediction sets generated this way are contiguous subintervals ofxn\+1x\_\{n\+1\}and the lower bound on coverage is guaranteed \([Equation˜1](https://arxiv.org/html/2605.06788#S2.E1)\)\[[21](https://arxiv.org/html/2605.06788#bib.bib4)\]\.
## Appendix CImplementation Details and Additional Results
### C\.1Additional Experiment Details
In this section we provide more detail about datasets, models, and experimental results\.
#### C\.1\.1Synthetic Dataset Generation
As discussed in[Section˜4\.1](https://arxiv.org/html/2605.06788#S4.SS1)we construct three variants of the original GSM8k dataset that explicitly control the location of the decisive error step along the execution trajectory\. Errors are first generated uniformly by selecting one step in a trajectory, and injecting instructions to the agent to fail on that step in various ways\[[17](https://arxiv.org/html/2605.06788#bib.bib27),[4](https://arxiv.org/html/2605.06788#bib.bib26)\]\. We generate 1200 failed trajectories for GSM8k for each of the DyLAN and MACNET agent frameworks \(and similarly for the MATH dataset\)\. Then the trajectories are split by error position, in the first third, middle third, or final third to form the Right\-, Mid\-, and Left\-Dense GSM8k datasets\.
#### C\.1\.2Context\-Engineered LLM Implementation
Prior work has shown that engineering the context of LLMs and prompting them with diverse role\-specific instructions can improve evaluation robustness\[[28](https://arxiv.org/html/2605.06788#bib.bib45)\]\. Following this general idea, we employ multiple role prompts, inspired by ECHO\[[3](https://arxiv.org/html/2605.06788#bib.bib32)\], to obtain step\-level scores from different perspectives\. We first use the LLM to summarize why the agent failed the task using the overall trace, then pass this information to LLMs with four different roles to generate scores\. The average of these scores is the final step score\. The exact prompts used for the context\-engineered LLM can be found in[Section˜C\.3](https://arxiv.org/html/2605.06788#A3.SS3)\.
#### C\.1\.3Fine\-tuned Scoring Function
As described in[Section˜3\.2](https://arxiv.org/html/2605.06788#S3.SS2), we fine\-tuned an LLM to estimate step\-level likelihoods of being the decisive error step\.
##### Training data
Due to limited data in Who&When\[[32](https://arxiv.org/html/2605.06788#bib.bib1)\], we only perform fine\-tuning on the MATH and GSM8k datasets with synthetically injected errors\. As mentioned in[Section˜C\.1\.1](https://arxiv.org/html/2605.06788#A3.SS1.SSS1), we generated 4,800 trajectories in total across datasets and MAS architectures\. Of these, 4,000 were used for model fine\-tuning, equally split across the datasets, architectures, and error locations\. We then used an 85%/15% split for training and validation to support hyperparameter selection\. The 800 trajectories not used for training were held out entirely from fine\-tuning and instead were used for conformal calibration and testing with an even split\. Throughout this work, we report model performance on this unseen testing split\.
##### Model Training
We fine\-tuned a Qwen3\-1\.7B\[[29](https://arxiv.org/html/2605.06788#bib.bib39)\]LLM for 20 epochs to estimate step\-wise error likelihoods in failed multi\-agent trajectories\. Since each trajectory had one step labeled as the decisive error \(determined through error injection\), we fine\-tuned the modelfθf\_\{\\theta\}on theℓ\\ell\-way classification task to produce a scalar scorefθ\(ci,j\)f\_\{\\theta\}\(c\_\{i,j\}\)for each stepjjin datapointii, interpreted as the likelihood that stepci,jc\_\{i,j\}contains the decisive error\. We writeei,je\_\{i,j\}as the binary indicator of whether stepjjin datapointiiis the decisive error\. The model parametersθ\\thetawere optimized via binary cross\-entropy:
ℒ\(θ\)=−∑i∑j=1ℓi\[ei,jlogσ\(fθ\(ci,j\)\)\+\(1−ei,j\)log\(1−σ\(fθ\(ci,j\)\)\)\],\\mathcal\{L\}\(\\theta\)=\-\\sum\_\{i\}\\sum\_\{j=1\}^\{\\ell\_\{i\}\}\\Big\[e\_\{i,j\}\\log\\sigma\\\!\\left\(f\_\{\\theta\}\(c\_\{i,j\}\)\\right\)\+\(1\-e\_\{i,j\}\)\\log\\\!\\left\(1\-\\sigma\\\!\\left\(f\_\{\\theta\}\(c\_\{i,j\}\)\\right\)\\right\)\\Big\],\(17\)whereσ\(⋅\)\\sigma\(\\cdot\)denotes the logistic sigmoid\.
Table 6:Fine\-tuning configurationModelDatasetEpochsBatch SizePrecisionLoRA\(r,α\)\(r,\\alpha\)LRQwen3\-1\.7BGSM8k \+ MATH18328\-bit\(2, 8\)2×10−62\{\\times\}10^\{\-6\}We provide more training set up details in[Table˜6](https://arxiv.org/html/2605.06788#A3.T6), and report the performance before and fine\-tuning the models in[Table˜7](https://arxiv.org/html/2605.06788#A3.T7)\.
##### Compute Resources
For fine\-tuning the 1\.7B parameter LLM we used an Nvidia GeForce RTX 5090\. 20 epochs of training took roughly one day, and consumed less than 16 GB of GPU memory with batch size 32\. All other calls to LLMs used commercial APIs, and we discussed the number of calls needed in[Section˜5\.4](https://arxiv.org/html/2605.06788#S5.SS4)\. Once scoring calls are made, performing conformal calibration is a trivial computational cost\.
Table 7:Performance of Qwen3\-1\.7B models before and after fine\-tuningModelAUROCAUPRCAccuracyPretrained0\.5050\.3690\.509Fine\-tuned0\.7620\.3820\.731
#### C\.1\.4Rollback Experiment Implementation
Figure 8:Example rollback scenario with text for the task, decisive error, final answer, and corrected versions after the rollback\. We first generate a conformal set for the failed task, roll back the state of the MAS to the first step in the prediction set, and restart the agent with instructions to avoid the same mistakes\.Figure[8](https://arxiv.org/html/2605.06788#A3.F8)provides a concrete example of the rollback procedure\. After identifying a conformal set for a failed trajectory using the LF algorithm, the system rolls back to the earliest step in the set and re\-executes the task from that point on with additional context from the failed trace, allowing the MAS to correct the final outcome\. Prompt details used during the restart are provided in[Section˜C\.3](https://arxiv.org/html/2605.06788#A3.SS3)\.
### C\.2Additional Experimental Results
#### C\.2\.1Scoring Aggregation Methods
In[Section˜3\.2](https://arxiv.org/html/2605.06788#S3.SS2)we discussed ways to tailor LLMs to the error attribution task and generate a step\-wise error likelihood score\. Step\-wise scores then need to be aggregated into a set\-wise score, ideally one that obeys monotonicity \([Equation˜2](https://arxiv.org/html/2605.06788#S3.E2)\)\. In that section we demonstrated the normalized sum function for aggregation,g\(cj:k\)=1ℓ∑i=jkLLM\(ci\),g\(c\_\{j:k\}\)=\\frac\{1\}\{\\ell\}\\sum\_\{i=j\}^\{k\}\\text\{LLM\}\(c\_\{i\}\),which was used for experiments in the main text\. Here we consider alternatives and show empirical results across them\.
##### Max Aggregation
Another way to monotonically aggregate step\-level scores is by taking the maximum score over a set:
gmax\(cj:k\)=maxi∈\{j,…,k\}LLM\(ci\)\+λk−jℓ\.g\_\{\\text\{max\}\}\(c\_\{j:k\}\)=\\max\_\{i\\in\\\{j,\.\.\.,k\\\}\}\\text\{LLM\}\(c\_\{i\}\)\+\\lambda\\frac\{k\-j\}\{\\ell\}\.\(18\)Note that as opposed to summation where length normalization is included on the sum to account for trajectories of different lengthsℓ\\ell, we do not normalize the max, since the single\-step max tends to be similar across trajectories of different lengths\. Instead we have included an optional length penalty term, weighted by the hyperparameterλ\\lambda, to penalize sets that are longer than necessary to contain the highest scoring single step\. Below, we also test adding a similar length penalty on the sum aggregation\. Max aggregation should be useful when the LLM produces very peaked step\-wise scores, as opposed to sum aggregation which can benefit when step\-wise scores are more uniform\.
##### LogSumExp Aggregation
As a third alternative, we consider a family of monotonic aggregations that interpolate between the extremes of sum and max by using the LogSumExp function:
gLSE\(cj:k\)=1βLogSumExp\(βℓ⋅\[LLM\(cj\),…,LLM\(ck\)\]\)−logℓ\.g\_\{\\text\{LSE\}\}\(c\_\{j:k\}\)=\\frac\{1\}\{\\beta\}\\text\{LogSumExp\}\(\\beta\\ell\\cdot\[\\text\{LLM\}\(c\_\{j\}\),\.\.\.,\\text\{LLM\}\(c\_\{k\}\)\]\)\-\\log\\ell\.\(19\)This aggregation includes the inverse temperature hyperparameterβ\\beta, where higher values behave more like the max function, whileβ\\betaclose to zero behaves like the mean function \(but is monotonic, unlike the true mean function\)\. We also add length normalization viaℓ\\ellto make scores more similar across trajectories with different lengths\.
##### Experimental Results
In[Table˜8](https://arxiv.org/html/2605.06788#A3.T8), we compare Max, LogsumExp, and Sum aggregation with the optional length penalty using the RF conformal algorithms on the Left\-Dense subset of GSM8k, across three evaluators\(Naive LLM, Role\-prompted, and the finetuned Qwen3\-1\.7B\) settings and two agent architectures \(DyLAN and MACNET\)\. The primary result is that RF is largely insensitive to the details of the scoring functiongg, from the LLM design, to the aggregation function, or various hyperparameters\. This means a simpler scoring function design is sufficient for use with our filtering algorithms\.
Table 8:Removal Rate with the Right Filtering conformal algorithm across various aggregation functions on GSM8K Left Dense dataset\.AggregationHypersDyLANMACNETNaive LLMRole\-promptedQwen3\-1\.7BNaive LLMRole\-promptedQwen3\-1\.7BSum \+ Length Penaltyλ=0\.01\\lambda=0\.010\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.010\.62±\\pm0\.02λ=0\.02\\lambda=0\.020\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.010\.62±\\pm0\.02λ=0\.05\\lambda=0\.050\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.010\.62±\\pm0\.02λ=0\.1\\lambda=0\.10\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.010\.62±\\pm0\.01Max \+ Length Penaltyλ=0\.01\\lambda=0\.010\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.020\.62±\\pm0\.02λ=0\.02\\lambda=0\.020\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.010\.62±\\pm0\.02λ=0\.05\\lambda=0\.050\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.020\.62±\\pm0\.01λ=0\.1\\lambda=0\.10\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.020\.62±\\pm0\.01Normalized LogSumExpβ=0\.1\\beta=0\.10\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.020\.63±\\pm0\.02β=1\\beta=10\.64±\\pm0\.020\.63±\\pm0\.020\.65±\\pm0\.010\.59±\\pm0\.010\.59±\\pm0\.020\.63±\\pm0\.02β=10\\beta=100\.63±\\pm0\.020\.63±\\pm0\.020\.64±\\pm0\.020\.60±\\pm0\.010\.59±\\pm0\.010\.62±\\pm0\.01
### C\.3Prompts Used
##### Naive LLM Prompts
We use following prompt for the naive LLM scoring function\.
You are an AI assistant tasked with analyzing a segment of a multi\-agent conversation\. Multiple agents are collaborating to address a user query, with the goal of resolving the query through their collective dialogue\.Your primary task is to identify the location of the most critical mistake within the provided segment\.The problem to address is as follows: \{problem\}The final correct answer for the problem is: \{answer\}The wrong answer given at the end of the conversation is: \{wrong\_answer\}Review the following conversation segment: \{evaluate\_content\}Cut from the entire history:\{chat\_segment\_content\}Based on your analysis, please respond with ONLY a single probability value between 0 and 1, representing the estimated probability that the critical error occurs within this conversation segment\. Do not include any explanation or additional text\.
##### Context Engineered Prompts
As mentioned earlier, along with the context engineered setting we use four role\-based LLMs as part of the scoring function\. Here, we provide the prompts used for each LLM\.
Conservative: \- Attribute an error only when explicit contradiction or logical error is visible\. \- Assign high confidence only if the evidence is direct and quotes are exact\.Liberal: \- Consider potential upstream or downstream causes even if indirect\. \- Assign moderate confidence \(0\.5\-\-0\.7\) for plausible but not proven causes\.Skeptical: \- Always include at least one alternative explanation and note missing evidence\. \- Lower confidence if evidence is incomplete\.Pattern: \- Focus on repeated reasoning or coordination issues\. \- Identify patterns across multiple steps rather than isolated mistakes\.
We force the output into the following schema:
Output Schema``` { "investigation_summary": "short summary", "primary_conclusion": { "agent": "Agent-X", "mistake_step": "int", "confidence": "float (0--1)", "evidence": ["quoted spans or step references"], "reason": "brief reason", "alternative_explanations": ["..."] } } ```
The final prompt for each role is as follows:
Problem to solve: \{problem\}Expected correct answer: \{answer\}Conversation segment under review: \{evaluate\_content\}Full chat context: \{chat\_segment\_content\}ROLE: \{role\} Analyst\{role\_instructions\}TASK: Identify whether this segment likely contains the key reasoning error that led to the wrong final answer\. Provide your analysis strictly in JSON format matching the schema below\. Quote evidence spans exactly from the text and assign a numeric confidence between 0 and 1\.Confidence mapping: 0\.0\-\-0\.2 = Very unlikely, 0\.2\-\-0\.4 = Unlikely, 0\.4\-\-0\.6 = Possible, 0\.6\-\-0\.8 = Likely, 0\.8\-\-1\.0 = Very likely\.Return ONLY valid JSON \(no extra text\): \{schema\}
##### Conformal Rollback Prompts
The conformal rollback instructions provided to the MAS are as follow:
NOTE: The following conversation history \(Node 0 to Node \{cut\_index\}\) contains wrong information\. Please avoid making the same mistakes if you detect the mistake:\{wrong\_conversation\_content\}With the following mostly correct conversation history: \{prev\_context\}Solve this problem step by step: \{task\_query\}Similar Articles
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
This paper introduces AgentForesight, a framework for online auditing and early failure prediction in LLM-based multi-agent systems. It presents a new dataset, AFTraj-22K, and a specialized model, AgentForesight-7B, which outperforms leading proprietary models in detecting decisive errors during trajectory execution.
Beyond Autonomy: The Power of an Agent That Knows Its Limits
The COWCORPUS project, a study of 4,200 human-AI interactions, found that agents predicting their own failures and intervention moments are more useful than those simply trying to avoid errors. Researchers identified four stable trust patterns in human-AI collaboration and developed the Perfect Timing Score (PTS) to measure intervention prediction accuracy.
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
This paper introduces FATE, an on-policy framework that leverages failure trajectories to enhance the safety and performance of tool-using LLM agents through self-evolution and Pareto-aware optimization.
@omarsar0: Cool paper from Apple. Most evaluation of tool-calling agents happens after the trajectory is over. By then the wrong c…
This Apple research paper introduces 'Reinforced Agent,' a method that moves evaluation into the execution loop using a specialized reviewer agent to correct tool-calling errors in real-time. It demonstrates significant accuracy improvements on benchmarks like BFCL and τ²-Bench without retraining the base agent.