Counterfactual Graph for Multi-Agent LLM Calibration
Summary
This paper introduces CAGE, a counterfactual graph-based method for calibrating multi-agent LLM systems, evaluating on benchmarks like TriviaQA and MMLU-Pro across various communication topologies. The method outperforms existing post-hoc and LLM-elicited calibration approaches.
View Cached Full Text
Cached at: 06/01/26, 09:27 AM
# Counterfactual Graph for Multi-Agent LLM Calibration
Source: [https://arxiv.org/html/2605.30653](https://arxiv.org/html/2605.30653)
### 6\.1Experimental Setup
We use the 25\-cell grid from Section[3](https://arxiv.org/html/2605.30653#S3), formed by five benchmarks and five communication topologies\. The benchmarks are TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2605.30653#bib.bib44)\), TruthfulQA\(Linet al\.,[2022b](https://arxiv.org/html/2605.30653#bib.bib45)\), MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib46)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30653#bib.bib47)\), and BIG\-Bench Hard\(Suzgunet al\.,[2023](https://arxiv.org/html/2605.30653#bib.bib48)\)\. The topologies are iid, debate, chain, hub\-spoke, and tree\. Each topology and benchmark cell is evaluated with three rollouts\. For each query, the panel first produces a plurality answer, and the calibrator then scores the reliability of that answer\. Metrics are averaged over topologies and rollouts within each benchmark\. The Mean column macro\-averages the five benchmarks\. ForCAGE\-Select, we evaluate only matched test groups where all five topology outputs are available for the same query\. Full experimental details are provided in Appendix[D](https://arxiv.org/html/2605.30653#A4)\.
### 6\.2Baselines and Metrics
We compareCAGE\-Calagainst three categories of baselines to address different evaluation dimensions:Post hoc plurality calibrators: Calibrate the confidence score computed from plurality shareKuncheva \([2004](https://arxiv.org/html/2605.30653#bib.bib20)\)on the validation split using three post\-hoc methods: \(1\) Platt scalingPlatt and others \([1999](https://arxiv.org/html/2605.30653#bib.bib12)\), \(2\) Isotonic regressionZadrozny and Elkan \([2002](https://arxiv.org/html/2605.30653#bib.bib17)\), and \(3\) Scaling\-binningKumaret al\.\([2019](https://arxiv.org/html/2605.30653#bib.bib15)\);LLM\-elicited confidence estimators\.These baselines use an LLM judge to elicit a probability\-like confidence score for the panel prediction\. They test whether panel reliability can be inferred from the final agent responses, optionally augmented with the topology description\. The baselines include: \(1\) LLM\-Cal without topology information, \(2\) LLM\-Cal with topology information, and \(3\) Collaborative Calibration\(Yanget al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib13)\);Trained calibrators\.\(1\) Scalar \+ GBTKeet al\.\([2017](https://arxiv.org/html/2605.30653#bib.bib19)\)uses vote, confidence, and graph\-summary features without relational encoding; \(2\) GraphCal\(Liet al\.,[2025](https://arxiv.org/html/2605.30653#bib.bib10)\)adapts graph\-based calibration to the panel setting; and \(3\) DiscoUQ\-LLM\(Jiang,[2026](https://arxiv.org/html/2605.30653#bib.bib51)\)serves as a strong baseline based on disagreement features\. We also evaluate answer entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.30653#bib.bib28)\), average log probabilityKadavathet al\.\([2022](https://arxiv.org/html/2605.30653#bib.bib25)\), DiverseAgentEntropy\(Fenget al\.,[2025](https://arxiv.org/html/2605.30653#bib.bib26)\), and MATU\(Chenet al\.,[2026](https://arxiv.org/html/2605.30653#bib.bib27)\)\. These methods provide uncertainty scores rather than calibrated probabilities, so we report AUROC and AUARC without interpreting the scores probabilistically\.
##### Metrics\.
Our primary metrics are ECE and AUROC\. ECE measures whether predicted confidence matches empirical correctness, while AUROC measures whether correct panel answers receive higher scores than incorrect ones\. We additionally report Brier score and AUARC as complementary metrics\. Brier score evaluates probability quality under a proper scoring rule, and AUARC evaluates selective prediction when panel answers are accepted in decreasing confidence order\. For ranking\-only UQ scores, we omit ECE unless a probability scale is introduced through the shared validation\-set calibration protocol\.
Figure 3:Mean Brier score by method family\(lower is better\)\. Within each family, bars are sorted worst→\\tobest \(light→\\todark\)\.CAGE\-Cal\(rightmost\) has the lowest Brier overall\.### 6\.3Main Calibration Results
Table[6](https://arxiv.org/html/2605.30653#S6)reports the headline in\-distribution results\. The first block shows that post\-hoc calibration can substantially reduce the ECE of plurality share, from18\.0418\.04to6\.986\.98with scaling\-binning\. However, these methods still rely on the same scalar agreement signal\. Platt scaling is monotonic and therefore preserves the AUROC of plurality share at72\.8072\.80; isotonic regression and scaling\-binning only change ranking slightly through ties and bins\. This confirms that post\-hoc calibration can improve probability scale, but cannot add the structural information needed to distinguish independent agreement from communication\-induced consensus\. LLM\-elicited confidence estimators are also limited in this setting\. Their mean ECE remains high, ranging from22\.5222\.52to23\.7823\.78, and adding topology information to the prompt brings only a small AUROC change\. This suggests that a topology label alone is not enough for a judge model to recover the query\-specific dependence among agents\. Among trained calibrators,CAGE\-Calachieves the best mean ECE and the strongest mean AUROC\. Compared with DiscoUQ\-LLM, the strongest prior trained baseline,CAGE\-Calreduces mean ECE from7\.087\.08to5\.565\.56and improves mean AUROC from73\.4673\.46to83\.6183\.61\. The AUROC gain is especially large on MMLU\-Pro and BBH, where correlated agreement is more harmful:CAGE\-Calimproves AUROC by13\.6913\.69points on MMLU\-Pro and24\.5524\.55points on BBH over DiscoUQ\-LLM\. These gains support the central claim that panel confidence should depend not only on how many agents agree, but also on how that agreement was formed\. Figure[3](https://arxiv.org/html/2605.30653#S6.F3)provides a complementary view through Brier score\.CAGE\-Calachieves the lowest mean Brier score at11\.211\.2, outperforming DiscoUQ\-LLM \(14\.514\.5\), Scalar \+ GBT \(15\.615\.6\), and GraphCal \(19\.219\.2\)\. Thus, the improvement is not only an artifact of ECE binning;CAGE\-Calalso produces sharper and better scaled probability estimates\.
### 6\.4Ranking\-Only UQ Comparison
Figure 4:AUROC and AUARC ofCAGE\-Calvs\. heuristic UQ baselines\.Each axis reports benchmark\-level mean over 3 rollouts and 5 topologies\. Per\-benchmark numbers in Appendix Table[7](https://arxiv.org/html/2605.30653#A4.T7)\.Figure[4](https://arxiv.org/html/2605.30653#S6.F4)comparesCAGE\-Calwith ranking\-only UQ baselines using AUROC and AUARC\. These methods provide useful uncertainty scores, but they do not define a calibrated probability scale\. Plurality vote, entropy, average log probability, DiverseAgentEntropy, and MATU all summarize the panel with scalar signals\. Such signals work when disagreement reflects uncertainty, but they miss the difference between benign diversity and false consensus\.CAGE\-Calgives the strongest overall reliability ranking across the five benchmarks\. The advantage is most visible on BBH and MMLU\-Pro, where scalar disagreement signals are less reliable\. This aligns with the failure\-mode analysis: when agents become correlated through shared model families or communication paths, the reliability of the final answer depends on the dependency structure behind the vote, not only on the vote distribution itself\.
### 6\.5Confidence\-Routed Topology Selection
We further test whether calibrated confidence can serve as a control signal for multi\-agent inference\. Instead of committing to a single topology for all queries,CAGE\-SelectusesCAGE\-Calconfidence to choose which topology output should be trusted for each query\. Figure[5](https://arxiv.org/html/2605.30653#S6.F5)shows that the best fixed topology reaches65\.18%65\.18\\%mean accuracy, while simple routing rules based on plurality share or mean log probability do not improve over it\.CAGE\-Selectreaches67\.23%67\.23\\%, a\+2\.05\+2\.05point gain\. Thus,CAGE\-Calconfidence is not only calibrated within a topology, but also comparable across topologies, making it useful as a control signal for multi\-agent inference\.
Figure 5:Mean accuracy of routing strategies\.Dashed line marks per\-bench best fixed \(65\.1865\.18\)\. Per\-bench breakdown in Appendix Table[E\.2](https://arxiv.org/html/2605.30653#A5.SS2)\.## 7Analysis
TriviaQATruthfulQAMMLU\-ProGSM8KBBHMeanMethodECE↓\\downarrowAUROC↑\\uparrowECE↓\\downarrowAUROC↑\\uparrowECE↓\\downarrowAUROC↑\\uparrowECE↓\\downarrowAUROC↑\\uparrowECE↓\\downarrowAUROC↑\\uparrowECE↓\\downarrowAUROC↑\\uparrowScalar \+ GBT11\.15±\\pm4\.4380\.28±\\pm4\.9523\.50±\\pm12\.6870\.46±\\pm2\.1416\.39±\\pm4\.0460\.73±\\pm4\.2810\.14±\\pm3\.8480\.58±\\pm2\.4913\.33±\\pm4\.2262\.48±\\pm2\.5414\.90±\\pm1\.7370\.91±\\pm2\.54GraphCal11\.93±\\pm4\.6680\.13±\\pm5\.0134\.15±\\pm9\.5463\.51±\\pm2\.8914\.41±\\pm2\.1754\.79±\\pm5\.8619\.59±\\pm6\.4380\.92±\\pm6\.5420\.05±\\pm4\.0970\.62±\\pm3\.6020\.03±\\pm1\.8770\.00±\\pm2\.40DiscoUQ\-LLM12\.52±\\pm5\.2482\.60±\\pm4\.9024\.42±\\pm13\.9371\.78±\\pm3\.7216\.83±\\pm4\.6759\.37±\\pm7\.9510\.09±\\pm4\.1284\.59±\\pm3\.9515\.25±\\pm4\.4063\.00±\\pm3\.5815\.82±\\pm8\.4572\.27±\\pm11\.32\\rowcolor\[RGB\]222,230,241CAGE\-Cal\(ours\)4\.63±\\pm0\.7185\.74±\\pm1\.418\.70±\\pm1\.8479\.59±\\pm4\.6711\.19±\\pm1\.2976\.67±\\pm3\.631\.89±\\pm1\.3180\.39±\\pm5\.756\.38±\\pm1\.3888\.67±\\pm1\.126\.56±\\pm0\.6882\.21±\\pm1\.84
Table 2:Leave\-one\-topology\-out \(LOTO\) generalization\.ECE \(↓\\downarrow\) and AUROC \(↑\\uparrow\), mean over the 5 held\-out\-topology folds\. Trained calibrators only\.VariantECE↓\\downarrowAUROC↑\\uparrowAUARC↑\\uparrow\\rowcolorgray\!20Scalar\-summary baselinesScalar summaries \+ LR8\.038\.03±0\.35\\pm 0\.3574\.1274\.12±1\.24\\pm 1\.2471\.6471\.64±0\.69\\pm 0\.69Scalar summaries \+ GBT7\.997\.99±0\.95\\pm 0\.9574\.7874\.78±1\.32\\pm 1\.3272\.7072\.70±0\.50\\pm 0\.50\\rowcolorgray\!20CAGE\-Calincremental variantsObserved graph encoder only6\.976\.97±0\.28\\pm 0\.2881\.2581\.25±0\.88\\pm 0\.8875\.6175\.61±0\.13\\pm 0\.13\+ Iid counterfactual tower6\.786\.78±0\.33\\pm 0\.3381\.8981\.89±1\.47\\pm 1\.4775\.7975\.79±0\.24\\pm 0\.24\+ Group\-level hyperedge stream6\.756\.75±0\.37\\pm 0\.3782\.5682\.56±1\.27\\pm 1\.2776\.0876\.08±0\.38\\pm 0\.38\+ Calibration\-aware objective \(full\)5\.56±0\.03\\pm 0\.0383\.61±1\.34\\pm 1\.3476\.47±0\.37\\pm 0\.37\\rowcolor\[RGB\]222,230,241Δ\\Deltavs\. base−1\.41\-1\.41\+2\.36\+2\.36\+0\.86\+0\.86
Table 3:Component ablation ofCAGE\-CAL, averaged over 25 in\-distribution cells \(percent, mean±\\pmstd over 3 rollouts\)\.Δ\\Deltarows show absolute changes from the previous variant in percentage points\. LR and GBT denote logistic regression and gradient\-boosted trees\.### 7\.1Component Ablation
Table[7](https://arxiv.org/html/2605.30653#S7)ablates the main components ofCAGE\-Cal\. Scalar\-summary baselines compress each panel into fixed statistics and remain far below graph\-based variants, with the stronger GBT head reaching only74\.7874\.78AUROC and72\.7072\.70AUARC\. The observed graph encoder raises AUROC to81\.2581\.25, showing that panel reliability depends on relational structure beyond aggregate vote and confidence summaries\. Adding the IID counterfactual tower further improves AUROC by0\.640\.64points, while the group\-level hyperedge stream adds another0\.670\.67points by capturing shared family, role, answer\-cluster, and exposure effects\. The calibration\-aware objective gives the largest ECE reduction, from6\.756\.75to5\.565\.56, without sacrificing ranking quality\. Overall, the gains come from modeling a counterfactual dependence shift, not from simply adding a stronger prediction head\.
### 7\.2Generalization to Held\-Out Topologies
We test topology generalization with leave\-one\-topology\-out evaluation\. Each fold removes one topology from both training and validation, and evaluates the calibrator on that unseen topology\. As shown in Table[2](https://arxiv.org/html/2605.30653#S7.T2),CAGE\-Calremains stable under this shift\. Its mean AUROC is82\.2182\.21, close to the in\-distribution result of83\.6183\.61, and its mean ECE rises only from5\.565\.56to6\.566\.56\. By comparison, DiscoUQ\-LLM reaches72\.2772\.27mean AUROC and15\.8215\.82mean ECE\. This indicates that scalar disagreement features do not transfer as well when the communication structure changes\. The result supports the relational design ofCAGE\-Cal\. Rather than relying on a topology label, it uses communication edges, local failure correlations, answer clusters, and group\-level dependency units\. These features are defined for any topology, allowing the calibration rule to transfer to unseen communication structures\.
### 7\.3Correcting the Two Failure Modes
Figure[6](https://arxiv.org/html/2605.30653#S7.F6)tests whetherCAGE\-Calcorrects the two calibration failures identified earlier\. In Mode A, iid/TriviaQA, plurality share is under\-confident because wrong answers are dispersed across weak clusters\. In Mode B, chain/TruthfulQA, plurality share is over\-confident because communication can concentrate agents on the same wrong answer\. In both cases,CAGE\-Calmoves the reliability curve closer to the perfect calibration line\. Thus, the same vote share can receive different confidence depending on how the agreement or disagreement was formed\.
Figure 6:Failure\-mode correction\.Bars show panel counts, and curves show empirical bin accuracy when examples are binned by plurality share or byCAGE\-Calconfidence\.CAGE\-Calreduces under\-confidence in Mode A and over\-confidence in Mode B\.## 8Conclusion
We investigate confidence calibration in multi\-agent LLM systems and show that agreement alone is an unreliable confidence signal\. Our analysis across different benchmarks and communication topologies identifies two recurring failure modes: DUC and COC\. We proposeCAGE\-Cal, a counterfactual agent\-graph calibration framework that contrasts post\-communication graphs with IID counterfactual graphs to separate independent evidence from correlated failure, which improves reliability discrimination with competitive calibration error\. We further introduceCAGE\-Select, which uses calibrated confidence to dynamically select the most reliable topology and improve final panel accuracy\. Overall, our results highlight the importance of topology\-aware calibration for reliable multi\-agent LLM systems\.
## 9Limitations
A limitation is that agent participation is not optimized\. We construct panels from a predefined agent pool, while real systems may benefit from query\-adaptive agent selection, where the system decides which agents are most useful for a given query\. Another limitation is that communication is not jointly optimized with calibration\. We evaluate a fixed set of candidate communication patterns, but practical systems may need adaptive communication routing, where agents decide which interactions are most useful based on intermediate answers and uncertainty\. ExtendingCAGE\-Calto jointly support adaptive agent selection and communication routing is an important future directionLiet al\.\([2026a](https://arxiv.org/html/2605.30653#bib.bib72)\); Zhanget al\.\([2025c](https://arxiv.org/html/2605.30653#bib.bib69)\); Shiet al\.\([2026](https://arxiv.org/html/2605.30653#bib.bib68)\); Huanget al\.\([2026](https://arxiv.org/html/2605.30653#bib.bib70)\)\.
## References
- Rewarding doubt: a reinforcement learning approach to calibrated confidence expression of large language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1)\.
- T\. Chen, H\. Yao, J\. Chen, E\. E\. Papalexakis, and H\. Wei \(2026\)Every response counts: quantifying uncertainty of llm\-based multi\-agent systems through tensor decomposition\.InACL,Cited by:[§D\.4\.4](https://arxiv.org/html/2605.30653#A4.SS4.SSS4.p1.1),[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px1.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.InarXiv,Cited by:[§D\.1](https://arxiv.org/html/2605.30653#A4.SS1.SSS0.Px4),[§6\.1](https://arxiv.org/html/2605.30653#S6.SS1.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.InICML,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p1.1)\.
- Y\. Feng, P\. M\. Htut, Z\. Qi, W\. Xiao, M\. Mager, N\. Pappas, K\. Halder, Y\. Li, Y\. Benajiba, and D\. Roth \(2025\)Rethinking llm uncertainty: a multi\-agent approach to estimating black\-box model uncertainty\.InEMNLP,Cited by:[§D\.4\.4](https://arxiv.org/html/2605.30653#A4.SS4.SSS4.p1.1),[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px1.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- Y\. Geifman and R\. El\-Yaniv \(2017\)Selective classification for deep neural networks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p1.1)\.
- Google \(2025\)Gemma\-3\-12b\-it\.Note:[https://huggingface\.co/google/gemma\-3\-12b\-it](https://huggingface.co/google/gemma-3-12b-it)Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px1.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InICML,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p1.1)\.
- J\. Huang, Z\. Zhang, K\. Shi, Y\. Ye, and C\. Zhang \(2026\)Evolverouter: co\-evolving routing and prompt for multi\-agent question answering\.arXiv\.Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§9](https://arxiv.org/html/2605.30653#S9.p1.1)\.
- B\. Jiang \(2026\)DiscoUQ: structured disagreement analysis for uncertainty quantification in llm agent ensembles\.InarXiv,Cited by:[§D\.4\.3](https://arxiv.org/html/2605.30653#A4.SS4.SSS3.p1.1),[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InACL,Cited by:[§D\.1](https://arxiv.org/html/2605.30653#A4.SS1.SSS0.Px1),[§6\.1](https://arxiv.org/html/2605.30653#S6.SS1.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.InarXiv,Cited by:[§D\.4\.4](https://arxiv.org/html/2605.30653#A4.SS4.SSS4.p1.1),[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px1.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu \(2017\)Lightgbm: a highly efficient gradient boosting decision tree\.InNeurIPS,Cited by:[§D\.4\.3](https://arxiv.org/html/2605.30653#A4.SS4.SSS3.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- T\. N\. Kipf and M\. Welling \(2017\)Semi\-supervised classification with graph convolutional networks\.InICLR,Cited by:[§D\.4\.3](https://arxiv.org/html/2605.30653#A4.SS4.SSS3.p1.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InICLR,Cited by:[§D\.4\.4](https://arxiv.org/html/2605.30653#A4.SS4.SSS4.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- A\. Kumar, P\. S\. Liang, and T\. Ma \(2019\)Verified uncertainty calibration\.InNeurIPS,Cited by:[§D\.4\.1](https://arxiv.org/html/2605.30653#A4.SS4.SSS1.p1.3),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- L\. I\. Kuncheva \(2004\)Combining pattern classifiers: methods and algorithms\.Wiley\.Cited by:[§D\.4\.1](https://arxiv.org/html/2605.30653#A4.SS4.SSS1.p1.3),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- Y\. Li, S\. Wang, L\. Huang, and L\. Liu \(2025\)Graph\-based confidence calibration for large language models\.TMLR\.Cited by:[§D\.4\.3](https://arxiv.org/html/2605.30653#A4.SS4.SSS3.p1.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- Z\. Li, J\. Huang, X\. Guo, G\. Wang, and C\. Zhang \(2026a\)Same signal, opposite meaning: direction\-informed adaptive learning for llm agents\.arXiv\.Cited by:[§9](https://arxiv.org/html/2605.30653#S9.p1.1)\.
- Z\. Li, X\. Wu, Z\. Wang, J\. Li, Y\. Tian, J\. Bi, Y\. Ma, Y\. Ye, and C\. Zhang \(2026b\)Graph is a substrate across data modalities\.InICML,Cited by:[§5\.2](https://arxiv.org/html/2605.30653#S5.SS2.SSS0.Px3.p1.6)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022a\)Teaching models to express their uncertainty in words\.TMLR\.Cited by:[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022b\)TruthfulQA: measuring how models mimic human falsehoods\.InACL,Cited by:[§D\.1](https://arxiv.org/html/2605.30653#A4.SS1.SSS0.Px2),[§6\.1](https://arxiv.org/html/2605.30653#S6.SS1.p1.1)\.
- MetaAI \(2024\)Llama\-3\-8b\-instruct\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.1\-8B\-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px1.p1.1)\.
- Microsoft \(2024\)Phi\-4\.Note:[https://huggingface\.co/microsoft/phi\-4](https://huggingface.co/microsoft/phi-4)Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px1.p1.1)\.
- J\. Plattet al\.\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.Advances in large margin classifiers\.Cited by:[§D\.4\.1](https://arxiv.org/html/2605.30653#A4.SS4.SSS1.p1.3),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- D\. Qiao, B\. Chen, F\. Cai, J\. Chen, W\. Li, F\. Jiang, Z\. Chen, H\. Zha, T\. Zhang, and B\. Wang \(2026\)Epistemic gain, aleatoric cost: uncertainty decomposition in multi\-agent debate for math reasoning\.InarXiv,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1)\.
- Qwen Alibaba \(2025\)Qwen3\-8b\.Note:[https://huggingface\.co/Qwen/Qwen3\-8B](https://huggingface.co/Qwen/Qwen3-8B)Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InEMNLP,Cited by:[§C\.1](https://arxiv.org/html/2605.30653#A3.SS1.p1.9)\.
- K\. Shi, Z\. Zhang, Z\. Yuan, K\. Murugesan, V\. Galassi, C\. Zhang, and Y\. Ye \(2026\)NG\-router: graph\-supervised multi\-agent collaboration for nutrition question answering\.InEACL,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§9](https://arxiv.org/html/2605.30653#S9.p1.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. V\. Le, E\. H\. Chi, D\. Zhou, and J\. Wei \(2023\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.InACL,Cited by:[§D\.1](https://arxiv.org/html/2605.30653#A4.SS1.SSS0.Px5),[§6\.1](https://arxiv.org/html/2605.30653#S6.SS1.p1.1)\.
- L\. Wang, W\. Xu, Y\. Lan, Z\. Hu, Y\. Lan, R\. K\. Lee, and E\. Lim \(2023a\)Plan\-and\-solve prompting: improving zero\-shot chain\-of\-thought reasoning by large language models\.InACL,Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px2.p1.1),[Figure 11](https://arxiv.org/html/2605.30653#A7.F11.pic1.2.2.2.1.1.6)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023b\)Self\-consistency improves chain of thought reasoning in language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.InNeurIPS,Cited by:[§D\.1](https://arxiv.org/html/2605.30653#A4.SS1.SSS0.Px3),[§6\.1](https://arxiv.org/html/2605.30653#S6.SS1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px2.p1.1),[Figure 11](https://arxiv.org/html/2605.30653#A7.F11.pic1.2.2.2.1.1.4)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2024\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.InCOLM,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p1.1)\.
- R\. Yang, D\. Rajagopal, S\. A\. Hayati, B\. Hu, and D\. Kang \(2024\)Confidence calibration and rationalization for llms via multi\-agent deliberation\.InarXiv,Cited by:[§D\.4\.2](https://arxiv.org/html/2605.30653#A4.SS4.SSS2.p1.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- M\. Yasunaga, X\. Chen, Y\. Li, P\. Pasupat, J\. Leskovec, P\. Liang, E\. H\. Chi, and D\. Zhou \(2024\)Large language models as analogical reasoners\.InICLR,Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px2.p1.1),[Figure 11](https://arxiv.org/html/2605.30653#A7.F11.pic1.2.2.2.1.1.10)\.
- B\. Zadrozny and C\. Elkan \(2002\)Transforming classifier scores into accurate multiclass probability estimates\.InKDD,Cited by:[§D\.4\.1](https://arxiv.org/html/2605.30653#A4.SS4.SSS1.p1.3),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.30653#S6.SS2.p1.1)\.
- G\. Zhang, Y\. Yue, Z\. Li, S\. Yun, G\. Wan, K\. Wang, D\. Cheng, J\. X\. Yu, and T\. Chen \(2025a\)Cut the crap: an economical communication pipeline for llm\-based multi\-agent systems\.InICLR,Cited by:[Appendix A](https://arxiv.org/html/2605.30653#A1.SS0.SSS0.Px1.p1.1)\.
- G\. Zhang, Y\. Yue, X\. Sun, G\. Wan, M\. Yu, J\. Fang, K\. Wang, T\. Chen, and D\. Cheng \(2025b\)G\-designer: architecting multi\-agent communication topologies via graph neural networks\.InICML,Cited by:[Appendix A](https://arxiv.org/html/2605.30653#A1.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, C\. Xiong, and C\. Wu \(2026\)Agentic confidence calibration\.InICML,Cited by:[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhang, K\. Shi, Z\. Yuan, Z\. Wang, T\. Ma, K\. Murugesan, V\. Galassi, C\. Zhang, and Y\. Ye \(2025c\)AgentRouter: a knowledge\-graph\-guided llm router for collaborative multi\-agent question answering\.arXiv\.Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§9](https://arxiv.org/html/2605.30653#S9.p1.1)\.
- H\. S\. Zheng, S\. Mishra, X\. Chen, H\. Cheng, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2024\)Take a step back: evoking reasoning via abstraction in large language models\.InICLR,Cited by:[§D\.2](https://arxiv.org/html/2605.30653#A4.SS2.SSS0.Px2.p1.1),[Figure 11](https://arxiv.org/html/2605.30653#A7.F11.pic1.2.2.2.1.1.8)\.
- Z\. Zhou, T\. Jin, J\. Shi, and Q\. Li \(2025\)Steerconf: steering llms for confidence elicitation\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1)\.
- X\. Zhu, C\. Zhang, Y\. Chi, T\. Stafford, N\. Collier, and A\. Vlachos \(2026\)Demystifying multi\-agent debate: the role of confidence and diversity\.InarXiv,Cited by:[§1](https://arxiv.org/html/2605.30653#S1.p2.1),[§2](https://arxiv.org/html/2605.30653#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024\)Language agents as optimizable graphs\.InICML,Cited by:[Appendix A](https://arxiv.org/html/2605.30653#A1.SS0.SSS0.Px1.p1.1)\.
## Appendix AAdditional Related Work
##### Graph\-based multi\-agent design and calibration under dependence\.
A related line of work represents multi\-agent systems as graphs and optimizes their communication structure\. GPTSwarm\(Zhugeet al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib29)\)treats language agents as optimizable graphs, G\-Designer\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.30653#bib.bib30)\)learns communication topologies with graph neural networks, and AgentPrune\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30653#bib.bib31)\)removes unnecessary agents or communication links for efficiency\. These methods focus mainly on task accuracy, routing, or computational cost\. They do not ask how a given topology changes the panel’s joint failure distribution, nor do they produce calibrated confidence for the panel answer\. Our work turns graph structure into a calibration signal: given a topology and the resulting panel outputs, we estimate whether the panel’s agreement reflects independent evidence or correlated failure\.
## Appendix BFailure\-Mode Analysis and Counterfactual Motivation
### B\.1Why Calibration Needs Counterfactual Graph Shifts
In this section, we first explain why confidence calibration requires both the iid graphGx0G\_\{x\}^\{0\}and the post\-communication graphGxTG\_\{x\}^\{T\}\. We then provide empirical evidence that the shift betweenGx0G\_\{x\}^\{0\}andGxTG\_\{x\}^\{T\}matters for confidence calibration\.
#### B\.1\.1Why Calibration Needs the IID and Post\-Communication Graphs
Figure 7:Sources of agent error correlation\.OLS coefficients on predictors of pair correlation𝐖ij\\mathbf\{W\}\_\{ij\}\. Communication structure dominates the backbone, which dominates the prompting role\.Predictor𝜷\\bm\{\\beta\}\(pp\)PartialR2R^\{2\}\(%\)Intercept \(baseline iid pair\)35\.40—Population\-side sourcesSame backbone family\+10\.360\.00Same prompting role\+0\.890\.01Topology\-induced sources \(vs\. iid baseline\)is\_chain\+17\.879\.37is\_debate\+11\.764\.29is\_tree\+6\.761\.13is\_hub\_spoke\+1\.150\.04Full\-modelR2R^\{2\}29\.02%nnpairs4,3254\{,\}325Table 4:Decomposition of agent error correlation𝐖ij\\mathbf\{W\}\_\{ij\}\.OLS on4,3254\{,\}325agent pairs from the 25 cells\.β\\betais the coefficient on the 0/1 predictor in percentage points; partialR2R^\{2\}is the variance lost when dropping that predictor\.Section[3](https://arxiv.org/html/2605.30653#S3)definesW\(x\)W\(x\)as the local dependency structure of a panel\. We now ask what creates this dependency\. If agents provided independent evidence, the off\-diagonal entries ofW\(x\)W\(x\)would be small and weakly structured\. Instead, we find two systematic sources of correlation: population\-side similarity among agents and communication\-induced coupling from the topology\. This distinction is central to calibration\. A topology\-agnostic score can observe the final vote distribution, but it cannot tell whether the distribution was produced by independent evidence or correlated failure\.
##### Where does failure correlation come from?
For each topology and benchmark cell, we aggregate the off\-diagonal entries ofW\(x\)W\(x\)and fit an ordinary least squares model over agent pairs:
𝐖ij\(c\)=β0\\displaystyle\\mathbf\{W\}\_\{ij\}^\{\(c\)\}=\\beta\_\{0\}\\;\+βF𝟏\[same family\]\+βR𝟏\[same role\]\\displaystyle\+\\;\\beta\_\{F\}\\mathbf\{1\}\[\\text\{\{same family\}\}\]\\;\+\\;\\beta\_\{R\}\\mathbf\{1\}\[\\text\{\{same role\}\}\]\+∑t∈𝒯∖\{iid\}βt𝟏\[topologyc=t\]\+ϵ\.\\displaystyle\+\\;\\sum\_\{t\\in\\mathcal\{T\}\\setminus\\\{\\text\{iid\}\\\}\}\\beta\_\{t\}\\mathbf\{1\}\[\\text\{\{topology\}\}\_\{c\}=t\]\\;\+\\;\\epsilon\.Figure[7](https://arxiv.org/html/2605.30653#A2.F7)visualizes the main coefficients, while the full regression results are reported in Table[4](https://arxiv.org/html/2605.30653#A2.T4)\. Table[4](https://arxiv.org/html/2605.30653#A2.T4)decomposes pairwise agent failure correlation into population\-side and topology\-induced sources\. The coefficients do not measure final panel accuracy; instead, they measure how much each factor increases the tendency of two agents to succeed or fail together\. Shared backbone families increase pairwise correlation, suggesting common model\-side blind spots, while communication topologies such as chain and debate further amplify correlated failures\. These results support our motivation for modeling both population\-side dependence and topology\-induced coupling in multi\-agent confidence calibration\.
The decomposition gives two observations\.
- •First, heterogeneous model families matter more than heterogeneous prompts\.Agents with the same backbone tend to fail together even when they use different prompting roles\. In contrast, shared prompting role contributes little once backbone family is controlled\. This suggests that prompt diversity can make a panel appear diverse without fully decorrelating its errors\.
- •Second, communication topology is a major source of additional dependence\.Chain has the largest topology coefficient, followed by debate and tree\. This order follows the amount of peer exposure in each topology: later agents in a chain see many previous outputs, debate agents see many peers in one round, and tree agents receive information through local branches\. Hub\-spoke is close to the iid baseline because spokes do not see one another\. Thus, part of the dependence is inherited from the heterogeneous agent population, while another part is induced by communication\.
##### Motivating the Counterfactual Graph Pair
This split directly motivates the graph pair used byCAGE\-Cal\. A counterfactual iid graphGx0G\_\{x\}^\{0\}captures the dependence that would exist without communication\. It represents population\-side structure, including shared model families, shared biases, and shared prompting conventions\. A post\-communication graphGxTG\_\{x\}^\{T\}captures the same matched agents after topologyTThas reshaped their outputs\. It contains both population\-side dependence and communication\-induced dependence\. BecauseGx0G\_\{x\}^\{0\}andGxTG\_\{x\}^\{T\}are built over the same agent identities, their contrast gives a query\-specific view of how communication changes the panel\. This is the information that scalar disagreement scores discard\. Section[5](https://arxiv.org/html/2605.30653#S5)turns this counterfactual graph pair into the input representation forCAGE\-Cal\.
### B\.2Communication\-Induced Correctness Shifts
Section[B\.1\.1](https://arxiv.org/html/2605.30653#A2.SS1.SSS1)shows that communication can increase failure correlation among agents\. We now examine the same effect at the agent level\. Communication can change an agent’s correctness in both directions\. It can help an initially wrong agent become correct, but it can also make an initially correct agent become wrong\. Mean accuracy only reflects the net balance of these two effects\. Calibration depends on how these shifted votes are distributed in the final panel\.
Figure 8:Per\-agent regression rate\.Fraction of iid\-correct agents that become wrong under each topology on the same \(question, rollout\)\. Substantial on hard benchmarks even in topologies whose mean accuracy is unchanged\.##### Regression and improvement\.
For agentii, queryxx, rolloutrr, and topologyτ\\tau, we call the agent*communication\-regressed*whenci\(iid\)\(x,r\)=1andci\(τ\)\(x,r\)=0\.c\_\{i\}^\{\(\\mathrm\{iid\}\)\}\(x,r\)=1\\quad\\text\{and\}\\quad c\_\{i\}^\{\(\\tau\)\}\(x,r\)=0\.That is, the agent answers correctly when run independently, but answers incorrectly after communication\. We call the symmetric case*communication\-improved*, whenci\(iid\)\(x,r\)=0andci\(τ\)\(x,r\)=1\.c\_\{i\}^\{\(\\mathrm\{iid\}\)\}\(x,r\)=0\\quad\\text\{and\}\\quad c\_\{i\}^\{\(\\tau\)\}\(x,r\)=1\.For each topologyτ\\tauand benchmarkbb, we report both the regression rateregτ,b\\mathrm\{reg\}\_\{\\tau,b\}and the improvement rateimprτ,b\\mathrm\{impr\}\_\{\\tau,b\}\. The iid run provides a matched counterfactual for the same agent on the same query\.
##### Communication can help accuracy while hurting calibration\.
Figure[8](https://arxiv.org/html/2605.30653#A2.F8)shows that regression is not rare\. On hard benchmarks, a substantial fraction of independently correct agents become wrong after communication\. For example, chain on TruthfulQA regresses nearly one third of agents that were correct in the iid setting\. Tree shows a similar pattern on TruthfulQA and MMLU\-Pro\.
At the same time, regression does not necessarily imply lower mean accuracy\. Communication can also improve agents that were initially wrong\. A topology can therefore be helpful for average accuracy while still injecting many wrong votes into the panel\. This distinction is central to calibration\. A corrected vote and a regressed vote both appear as confident votes in the final answer distribution\. Plurality vote cannot tell whether a larger answer cluster was formed by independent correction or by harmful influence\.
The key implication is that communication changes the reliability of votes, not only their average correctness\. Two topologies can have similar mean accuracy but very different calibration behavior, because their correctness shifts land in different answer clusters\. The next section shows how these shifts lead to two opposite calibration failures: diversity\-induced under\-confidence and communication\-induced over\-confidence\.
##### Communication Can Both Correct and Corrupt Agents
TopologyBenchmarkReg\. rateImpr\. rate𝚫\\bm\{\\Delta\}Acc \(pp\)DebateTriviaQA5\.44%39\.55%\+7\.44TruthfulQA25\.60%11\.05%\+3\.16MMLU\-Pro19\.15%20\.24%\+4\.16GSM8K5\.86%54\.65%\+5\.90BBH14\.99%28\.68%\+2\.78ChainTriviaQA7\.05%35\.10%\+5\.02TruthfulQA31\.73%9\.49%\+0\.62MMLU\-Pro21\.52%21\.78%\+4\.10GSM8K6\.12%54\.53%\+5\.66BBH15\.74%28\.92%\+2\.43Hub\-spokeTriviaQA6\.31%15\.86%\+0\.04TruthfulQA23\.12%6\.04%\-0\.24MMLU\-Pro17\.09%12\.69%\+0\.53GSM8K6\.03%24\.85%\-0\.03BBH13\.31%19\.79%\+0\.16TreeTriviaQA5\.64%31\.45%\+5\.02TruthfulQA30\.52%7\.51%\-0\.26MMLU\-Pro21\.99%16\.97%\+0\.94GSM8K6\.62%43\.69%\+2\.64BBH17\.19%24\.34%\-0\.16
Table 5:Communication\-induced regression rates\.*Reg\.*==fraction of iid\-correct agents that became wrong under the row’s topology on the same \(question, rollout\)\.*Impr\.*is the converse\.Δ\\DeltaAcc is the net per\-agent accuracy change in percentage points\.Table[5](https://arxiv.org/html/2605.30653#A2.T5)analyzes how communication changes individual agent correctness relative to the iid setting\. The regression rate measures the fraction of iid\-correct agents that become incorrect after communication, while the improvement rate measures the reverse transition from incorrect to correct\. Communication can improve many agents, especially on GSM8K, but it can also corrupt initially correct agents, with particularly high regression rates on TruthfulQA\. This shows that communication reshapes agent\-level failure patterns rather than simply improving accuracy, motivating calibration methods that account for topology\-induced dependence\.
### B\.3Calibration Error Varies Across Communication Topologies
TopologyMean ECEdebate14\.5914\.59±\\pm8\.26tree14\.8214\.82±\\pm6\.09hub\-spoke17\.5517\.55±\\pm8\.02iid18\.0018\.00±\\pm8\.60chain19\.4819\.48±\\pm16\.31Table 6:Per\-topology mean majority\-vote ECE\(%, mean±\\pmstd across benchmarks\)\. Chain is worst\-calibrated and benchmark\-variant \(benign on TriviaQA, catastrophic on TruthfulQA\)\.Table[6](https://arxiv.org/html/2605.30653#A2.T6)reports the majority\-vote ECE for each communication topology, averaged across benchmarks\. Unlike plurality accuracy, which remains relatively stable across topologies, calibration error varies more substantially\. In particular, chain has the highest mean ECE and the largest variance, suggesting that communication structure can make vote\-based confidence unreliable even when final\-answer accuracy changes little\. This motivates our focus on calibrating panel confidence rather than only improving the predicted answer\.
### B\.4Vote Share Is Not a Reliable Confidence Estimate
Figure 9:Reliability diagrams per \(topology, benchmark\) cell\.Plurality vote share \(xx\) vs\. empirical accuracy \(yy\)\. Bubble area is the bin’s panel count\. Points above the identity line are under\-confident \(Mode A\); below, over\-confident \(Mode B\)\.Figure[9](https://arxiv.org/html/2605.30653#A2.F9)shows that plurality vote share has different calibration behavior across benchmarks and communication topologies\. Each point compares a vote\-share bin with the empirical accuracy of panel predictions in that bin; points on the diagonal indicate perfect calibration\. The patterns are highly cell\-dependent: on TriviaQA and GSM8K, many points lie above the diagonal, meaning that vote share underestimates correctness, while on TruthfulQA many points lie below the diagonal, meaning that vote share overestimates correctness\. Moreover, the same benchmark can behave differently under different topologies, such as iid, debate, chain, hub\-spoke, and tree\. This shows that vote share alone cannot determine panel confidence; calibration must account for both task domain and the communication structure that produced the vote\.
### B\.5Agent Error Correlations Vary Across Topologies and Benchmarks
Figure 10:Per\-\(topology, benchmark\) agent error correlation𝐖\\mathbf\{W\}\.Each panel is anN×NN\\times NPearson correlation of the binary error indicatorcic\_\{i\}across panel agents, with per\-query mean residualisation to control for task difficulty\. Agents are sorted by \(backbone, role\) so visual patterns are comparable across topologies\.Figure[10](https://arxiv.org/html/2605.30653#A2.F10)shows that agent correctness correlations vary across model families, communication topologies, and benchmarks\. Darker entries indicate that two agents tend to succeed or fail together\. Within\-family blocks, such as agents sharing the same Gemma, Llama, Phi, or Qwen backbone, are often darker, suggesting shared model\-side blind spots\. Comparing rows shows that topology also changes the dependence pattern: chain and debate often make the matrices darker than iid, indicating stronger communication\-induced coupling\. Comparing columns shows that this effect is benchmark\-dependent: correlations are much stronger on TriviaQA and TruthfulQA than on MMLU\-Pro, GSM8K, or BBH\. These patterns show that multi\-agent reliability depends jointly on heterogeneous LLM composition, communication structure, and task domain, motivating topology\-aware calibration\.
## Appendix CMethod Details
### C\.1Node Feature Construction
Each node represents one agent\. For bothGxTG\_\{x\}^\{T\}andGx0G\_\{x\}^\{0\}, we use the same 23\-dimensional node feature schema:
vi\(x\)=\[si,pi,ri,μi,σi2,mi,n−1,ai\]v\_\{i\}\(x\)=\[s\_\{i\},\\,p\_\{i\},\\,r\_\{i\},\\,\\mu\_\{i\},\\,\\sigma\_\{i\}^\{2\},\\,m\_\{i\},\\,n^\{\-1\},\\,a\_\{i\}\]The confidence scoresis\_\{i\}is computed from the agent’s mean answer log probability\. We clip the mean log probability to\[−10,0\]\[\-10,0\]and linearly map it to a normalized score\. The plurality indicatorpip\_\{i\}is one when the agent’s normalized answer matches the panel plurality answer\. The vote rankrir\_\{i\}is normalized by the number of distinct answer clusters in the panel\. The correlation summaries\(μi,σi2,mi\)\(\\mu\_\{i\},\\sigma\_\{i\}^\{2\},m\_\{i\}\)are the row mean, row variance, and row maximum of\|Wij\(x\)\|\|W\_\{ij\}\(x\)\|over other agents\. For the observed tower, these summaries are computed from the post communication correlation matrix\. For the iid counterfactual tower, they are computed from the iid correlation matrix\. Finally, the answer embeddingaia\_\{i\}is obtained by encoding the agent’s answer with Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.30653#bib.bib23)\)and reducing the embedding to 16 dimensions by PCA\. Thus, each node records both the agent’s individual contribution to the vote and its local dependence pattern with the rest of the panel\.
### C\.2Instance\-Conditional Correlation Estimation
We estimate the correlation from the training split only\. For each query, we retrieve thek=20k=20nearest training queries in Sentence\-BERT embedding space and compute the empirical Pearson correlation of agent correctness over this local neighborhood\. Pairs with nearly zero variance are assigned correlation zero\. Validation and test labels are never used to construct these matrices\.
In implementation, we keep an edge if either the agents are directly connected by communication or the absolute local correlation exceeds0\.050\.05\. For the iid counterfactual graph, only the correlation criterion is used\. This keeps the graph focused on meaningful dependence while preserving direct communication links\. If no edge remains after thresholding, we use a fully connected graph without self loops to avoid degenerate isolated graphs\.
## Appendix DImplementation Details
### D\.1Benchmarks
We evaluate on five English\-language benchmarks chosen to span distinct skill profiles: short\-form factual recall \(TriviaQA\), truthfulness in the face of common misconceptions \(TruthfulQA\), broad expert\-level knowledge \(MMLU\-Pro\), grade\-school numerical reasoning \(GSM8K\), and a diverse battery of hard reasoning tasks \(BBH\)\.
##### TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2605.30653#bib.bib44)\)\.
A large\-scale closed\-book QA dataset of factoid trivia questions sourced from quiz league competitions\. Gold answers are short strings, and an official multi\-answer alias set is provided to accommodate surface\-form variation at grading time\.
##### TruthfulQA\(Linet al\.,[2022b](https://arxiv.org/html/2605.30653#bib.bib45)\)\.
817817questions designed to probe whether models reproduce common human misconceptions or imitative falsehoods\. We use the open\-ended generation setting, in which each question ships with reference sets of correct and incorrect answers used to grade free responses\.
##### MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib46)\)\.
An enhanced version of MMLU with up to ten answer options per question and stronger distractors than the original, covering1414subject categories\. Questions are presented in standard A–J multiple\-choice format and graded by the selected letter\.
##### GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30653#bib.bib47)\)\.
Grade\-school math word problems requiring multi\-step arithmetic reasoning, each with a single numerical gold answer\. Grading compares numerical equality of the model’s parsed answer against the canonical gold value\.
##### BBH\(Suzgunet al\.,[2023](https://arxiv.org/html/2605.30653#bib.bib48)\)\.
BIG\-Bench Hard, a collection of2323subtasks drawn from BIG\-Bench on which prior LLMs underperformed humans\. Subtasks vary in answer format \(multiple choice, yes/no, free\-form\), and grading dispatches to a subtask\-specific judge\.
### D\.2Agent designs
##### Backbones\.
We use four open\-weight LLM backbones drawn from four distinct model families:Qwen3\-8B\(Qwen Alibaba,[2025](https://arxiv.org/html/2605.30653#bib.bib64)\),Llama\-3\.1\-8B\(MetaAI,[2024](https://arxiv.org/html/2605.30653#bib.bib61)\),Gemma\-3\-12B\(Google,[2025](https://arxiv.org/html/2605.30653#bib.bib65)\), andPhi\-4\(Microsoft,[2024](https://arxiv.org/html/2605.30653#bib.bib66)\)\. Using four independently pretrained families is intended to expose realistic disagreement: agents sometimes fail in correlated ways because they share training data or recipe, and sometimes disagree because they come from different pipelines\.
##### Prompting roles\.
Each backbone is paired with one of five atomic single\-pass reasoning roles, each adapted verbatim from its source paper and summarized in Appendix[G](https://arxiv.org/html/2605.30653#A7):direct\(zero\-shot, no reasoning shown\),cot\(Weiet al\.,[2022](https://arxiv.org/html/2605.30653#bib.bib57)\),plan\_solve\(Wanget al\.,[2023a](https://arxiv.org/html/2605.30653#bib.bib58)\),step\_back\(Zhenget al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib59)\), andanalogical\(Yasunagaet al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib60)\)\. Restricting each agent to a single LLM call and a single role keeps the role axis and the topology axis orthogonal, so any multi\-round or multi\-agent behavior comes from the topology rather than from a role internally hiding a sub\-population\.
##### Canonical population\.
An*agent*is a \(backbone, prompting role\) pair\. The full4×54\\times 5cross product yields2020deterministically ordered atomic agents that constitute our canonical population, and the*same*population is reused across every topology so that any cross\-topology difference can be attributed to communication structure rather than to a different mix of agents\.
### D\.3Communication Topology Design
We compare five communication topologies that span the spectrum from no communication to fully connected, and from peer\-to\-peer to hierarchical aggregation\. All topologies share the same agent population, the same per\-agent sampling temperature, the same maximum decode length, and the same plurality\-vote aggregator at the end\. Each topology is run with33rollouts per query for reproducibility\.
##### iid\(no communication\)\.
TheNNagents answer the query independently and in parallel\. This is the standard self\-consistency setting and serves as the counterfactual baseline against which the four communicating topologies are compared\. The iid rollout is also the source of the counterfactual graphGx0G^\{0\}\_\{x\}used byCAGE\-Cal\.
##### debate\(two\-round full mesh\)\.
Round 1 is identical to iid\. In round 2, every agent receives a formatted block listing all other agents’ round\-1 answers and is invited to revise\. The final panel is the set of round\-2 answers, so every agent has been exposed to every peer exactly once before committing\.
##### chain\(sequential pipeline\)\.
TheNNagents are arranged in a per\-rollout shuffled ordera1→a2→⋯→aNa\_\{1\}\\to a\_\{2\}\\to\\cdots\\to a\_\{N\}\. Agentaia\_\{i\}sees the formatted answers ofa1,…,ai−1a\_\{1\},\\ldots,a\_\{i\-1\}before producing its own\. The final panel is the set of allNNanswers; the last agent has the largest peer context, the first agent has none\.
##### hub\-spoke\(centralized aggregation\)\.
TheN−1N\{\-\}1spoke agents answer the query independently and in parallel\. The hub then sees a formatted block of all spoke answers labelled as “worker” contributions and produces its own answer\. The final panel includes the hub and all spokes, with the hub’s answer typically dominating the plurality vote\.
##### tree\(hierarchical aggregation\)\.
A complete binary tree of depthLLaggregates leaf answers upward\. At the bottom layer,2L2^\{L\}leaf agents answer independently\. Each internal node sees the answers of its two children and produces its own answer; the root produces the panel’s most informed answer\. All nodes’ answers are retained for the panel\. With our canonicalN=15N\{=\}15subset, the tree has depthL=3L\{=\}3and shape8\+4\+2\+18\{\+\}4\{\+\}2\{\+\}1\.
### D\.4Baselines
We compareCAGE\-Calagainst four families of baselines\. The first family applies classical post\-hoc calibration to the raw plurality vote share\. The second asks an LLM judge to score the panel directly\. The third trains a learned calibration head on validation data\. The fourth reports ranking\-only uncertainty scores that do not target calibrated probabilities\.
#### D\.4\.1Post\-hoc plurality calibrators
Methods in this family use the plurality vote share as their raw confidence signal\(Kuncheva,[2004](https://arxiv.org/html/2605.30653#bib.bib20)\)\. This is the fraction of agents that support the panel’s most\-voted answer\. The four methods differ only in how this scalar is reshaped into a final probability\.*Plurality share*uses the raw value with no learnable parameters and serves as the simplest calibration reference\.*\+\+Platt scaling*\(Platt and others,[1999](https://arxiv.org/html/2605.30653#bib.bib12)\)fits a parametric logistic mapping on the validation split\.*\+\+Isotonic regression*\(Zadrozny and Elkan,[2002](https://arxiv.org/html/2605.30653#bib.bib17)\)instead fits a non\-parametric monotone mapping\. It can correct non\-sigmoidal miscalibration but introduces more variance\.*\+\+Scaling\-binning*\(Kumaret al\.,[2019](https://arxiv.org/html/2605.30653#bib.bib15)\)chains a parametric scaler with empirical bin\-mean replacement and provides finite\-sample ECE guarantees\.
#### D\.4\.2LLM\-elicited confidence estimators
These baselines test whether panel correctness can be inferred from the final agent responses alone\. We prompt an LLM judge to return a probability that the plurality answer is correct\. The judge sees the question, the agent answers, and the plurality answer, but no graph or per\-token statistics\. Full prompts appear in Appendix[G](https://arxiv.org/html/2605.30653#A7)\.*LLM\-Cal \(no topology\)*queries the judge directly\.*LLM\-Cal \(\+\+topology\)*additionally injects a one\-line description of the panel’s communication topology into the prompt\. This isolates whether the judge can exploit topology information when given it explicitly\.*Collaborative Calibration*\(Yanget al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib13)\)prompts the judge to silently simulate a multi\-expert deliberation before emitting a single consensus probability\.
#### D\.4\.3Trained calibrators
These baselines train a calibration head on validation panels and target plurality\-answer correctness\. To make the comparison fair, all three trained baselines receive the same per\-benchmark post\-hoc protocol as our method\.*Scalar\+\+GBT*\(Keet al\.,[2017](https://arxiv.org/html/2605.30653#bib.bib19)\)is a strong feature\-based baseline\. It trains a gradient\-boosted decision tree on hand\-crafted scalar panel statistics, with no relational encoding\.*GraphCal*\(Liet al\.,[2025](https://arxiv.org/html/2605.30653#bib.bib10)\)adapts a graph\-based calibrator to the panel setting\. It encodes the observed agent graph with a GCN\(Kipf and Welling,[2017](https://arxiv.org/html/2605.30653#bib.bib67)\), but it omits the counterfactual iid view, the hyperedge stream, and the failure\-correlation edges used byCAGE\-Cal\.*DiscoUQ\-LLM*\(Jiang,[2026](https://arxiv.org/html/2605.30653#bib.bib51)\)is a disagreement\-feature baseline\. It trains a head on a small set of panel\-level disagreement summaries\.
TriviaQATruthfulQAMMLU\-ProGSM8KBBHMeanMethodAUROC↑\\uparrowAUARC↑\\uparrowAUROC↑\\uparrowAUARC↑\\uparrowAUROC↑\\uparrowAUARC↑\\uparrowAUROC↑\\uparrowAUARC↑\\uparrowAUROC↑\\uparrowAUARC↑\\uparrowAUROC↑\\uparrowAUARC↑\\uparrowPlurality vote81\.89±\\pm0\.7993\.09±\\pm1\.1471\.16±\\pm3\.8931\.30±\\pm1\.2261\.78±\\pm0\.8053\.02±\\pm0\.4084\.01±\\pm2\.2297\.87±\\pm0\.2266\.15±\\pm1\.4984\.22±\\pm0\.4672\.99±\\pm1\.5271\.90±\\pm0\.07Answer entropy80\.19±\\pm0\.6792\.82±\\pm1\.1170\.83±\\pm2\.9731\.36±\\pm0\.9357\.92±\\pm0\.9851\.01±\\pm0\.7082\.60±\\pm0\.9997\.72±\\pm0\.0867\.98±\\pm1\.3084\.94±\\pm0\.5271\.90±\\pm1\.1171\.57±\\pm0\.06Avg\-logprob69\.83±\\pm2\.8186\.96±\\pm0\.7460\.31±\\pm0\.4127\.65±\\pm1\.0143\.65±\\pm2\.0827\.33±\\pm0\.4057\.87±\\pm2\.3275\.41±\\pm0\.4846\.11±\\pm1\.9049\.77±\\pm1\.4755\.56±\\pm0\.9153\.42±\\pm0\.82DAE80\.19±\\pm0\.6692\.82±\\pm1\.1170\.85±\\pm3\.0131\.34±\\pm0\.9557\.91±\\pm0\.9451\.01±\\pm0\.7282\.61±\\pm1\.0097\.72±\\pm0\.0867\.96±\\pm1\.3184\.94±\\pm0\.5371\.91±\\pm1\.1271\.57±\\pm0\.05MATU71\.23±\\pm2\.5589\.73±\\pm0\.3359\.80±\\pm2\.5726\.88±\\pm1\.8261\.39±\\pm0\.9652\.20±\\pm0\.2456\.68±\\pm1\.3295\.45±\\pm0\.1948\.81±\\pm2\.9173\.89±\\pm0\.1959\.58±\\pm1\.6867\.63±\\pm0\.42\\rowcolor\[RGB\]222,230,241CAGE\-Cal\(ours\)86\.12±\\pm0\.7893\.02±\\pm0\.2679\.66±\\pm2\.4338\.76±\\pm1\.4477\.74±\\pm0\.7561\.54±\\pm0\.8884\.04±\\pm3\.5597\.51±\\pm0\.5190\.48±\\pm0\.8191\.52±\\pm0\.7483\.61±\\pm1\.3476\.47±\\pm0\.37
Table 7:Heuristic UQ baselines andCAGE\-Cal: per\-benchmark AUROC and AUARC\.Mean±\\pmstd over 3 rollouts, averaged across the 5 topologies\. ECE not applicable to the five ranking\-only methods;CAGE\-Cal’s ECE is in Table[6](https://arxiv.org/html/2605.30653#S6)\.CAGE\-CalAUROC matches Table[6](https://arxiv.org/html/2605.30653#S6)\.
#### D\.4\.4Ranking\-only UQ scores
This family does not produce calibrated probabilities\. It only yields scalar uncertainty estimates that can rank panels by reliability, so we report only AUROC and AUARC for these methods\.*Answer entropy*\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.30653#bib.bib28)\)is the Shannon entropy of the panel’s distribution over distinct agent answers\.*Average log probability*\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.30653#bib.bib25)\)averages each agent’s mean per\-token log\-probability across the panel\. It uses only signals that the decoder already provides\.*DiverseAgentEntropy*\(Fenget al\.,[2025](https://arxiv.org/html/2605.30653#bib.bib26)\)extends answer entropy by first diversifying the agents that produce the answer distribution\.*MATU*\(Chenet al\.,[2026](https://arxiv.org/html/2605.30653#bib.bib27)\)arranges the panel’s per\-agent answer distribution into a low\-rank tensor and uses thePARAFAC2reconstruction residual as the uncertainty score\.
### D\.5Training Details
##### Data splits\.
For each of the five benchmarks we sample500500problems with a fixed seed and deterministically split them60:20:2060\{:\}20\{:\}20into train, validation, and test partitions\. The same split is reused across every topology, every method, and every seed, so a question is never seen in training under one topology and held out under another\. Because each problem is rolled out through every topology with three independent rollouts, the resulting panel pool contains roughly37,50037\{,\}500panels in total\. The training split fits the encoder and the calibration head, the validation split fits the post\-hoc calibrators and selects model checkpoints, and the test split is used only for final reporting\. For the leave\-one\-topology\-out experiments the same per\-benchmark splits are kept fixed and only the topology used during training is varied\.
##### Optimization\.
We train the encoder for1515epochs with AdamW, an initial learning rate of2×10−32\\\!\\times\\\!10^\{\-3\}, weight decay of3×10−43\\\!\\times\\\!10^\{\-4\}, and a cosine learning\-rate schedule\. Each batch contains128128panels and gradients are clipped to a global norm of1\.01\.0\. The training loss combines binary cross\-entropy on plurality\-answer correctness with a Brier auxiliary term weighted by0\.40\.4, and the target is label\-smoothed withα=0\.05\\alpha=0\.05to prevent the model from collapsing to extreme probabilities on high\-agreement panels\. Model selection on the validation split uses the composite scoreAUROC−0\.5⋅ECE\\text\{AUROC\}\-0\.5\\cdot\\text\{ECE\}, with an early\-stopping patience of55epochs\. We train an ensemble of1010random seeds and average the predicted probabilities, then apply a per\-benchmark Beta calibration and scaling\-binning post\-hoc and blend the two outputs with equal weight\.
##### Computational cost\.
The bulk of the wall\-clock cost in our pipeline lies in generating the multi\-agent panels with vLLM\-served open\-weight LLMs, which we cache and reuse across all methods\. TrainingCAGE\-Calon top of the cached panels is comparatively cheap: a single1515\-epoch training run completes in roughly55–1010minutes on a single NVIDIA A100, and the full1010\-seed ensemble plus per\-benchmark post\-hoc fits in well under two hours on the same hardware\.
## Appendix EAdditional Calibration and Uncertainty Results
### E\.1Ranking\-Only UQ Comparison
Table[7](https://arxiv.org/html/2605.30653#A4.T7)comparesCAGE\-Calwith heuristic uncertainty baselines across benchmarks\. Plurality vote and answer entropy rely only on the final answer distribution, while log\-probability and other uncertainty scores do not explicitly model agent dependencies\.CAGE\-Calachieves the best mean AUROC and AUARC, showing that topology\-aware dependency features provide a stronger signal for estimating whether the panel answer is correct\.
### E\.2Brier Score Comparison
MethodTriviaQATruthfulQAMMLU\-ProGSM8KBBHMean\\rowcolorgray\!20Post\-hoc plurality calibratorsPlurality share12\.12±\\pm0\.2324\.52±\\pm0\.9025\.01±\\pm0\.189\.61±\\pm0\.2123\.89±\\pm0\.1119\.03±\\pm0\.19\+ Platt10\.91±\\pm0\.0616\.36±\\pm0\.7624\.30±\\pm0\.194\.37±\\pm0\.1817\.97±\\pm0\.0214\.78±\\pm0\.21\+ Isotonic10\.89±\\pm0\.1816\.45±\\pm0\.9324\.76±\\pm0\.424\.50±\\pm0\.2717\.73±\\pm0\.2914\.87±\\pm0\.35\+ Scaling\-bin\.12\.29±\\pm0\.0916\.06±\\pm0\.3024\.21±\\pm0\.024\.54±\\pm0\.0117\.98±\\pm0\.1115\.02±\\pm0\.03\\rowcolorgray\!20LLM\-elicited confidence estimatorsLLM\-Cal \(no topo\)10\.11±\\pm0\.0946\.90±\\pm0\.3435\.46±\\pm1\.3610\.19±\\pm0\.0419\.37±\\pm0\.6924\.41±\\pm0\.45LLM\-Cal \(\+topo\)10\.75±\\pm0\.0746\.51±\\pm0\.2434\.60±\\pm0\.528\.96±\\pm0\.5619\.40±\\pm0\.0924\.04±\\pm0\.04Collab\. Cal\.10\.32±\\pm0\.3244\.21±\\pm0\.6333\.71±\\pm0\.978\.41±\\pm0\.7418\.26±\\pm0\.3322\.98±\\pm0\.17\\rowcolorgray\!20Trained calibratorsScalar \+ GBT11\.43±\\pm0\.6217\.49±\\pm1\.0525\.96±\\pm0\.754\.87±\\pm0\.0518\.11±\\pm0\.0515\.57±\\pm0\.22GraphCal12\.29±\\pm0\.3528\.32±\\pm0\.7726\.28±\\pm0\.089\.25±\\pm0\.0320\.18±\\pm0\.0219\.26±\\pm0\.10DiscoUQ\-LLM10\.47±\\pm0\.1915\.31±\\pm0\.6124\.02±\\pm0\.034\.46±\\pm0\.0917\.57±\\pm0\.1414\.37±\\pm0\.16\\rowcolor\[RGB\]222,230,241CAGE\-Cal\(ours\)9\.55±\\pm0\.8013\.57±\\pm0\.5319\.26±\\pm0\.464\.33±\\pm0\.229\.26±\\pm0\.4811\.19±\\pm0\.50
Table 8:Brier score \(↓\\downarrow\) per benchmark\.Mean±\\pmstd over 3 rollouts, averaged across the 5 topologies\. Reported only for methods that produce a probability natively\.CAGE\-Calattains the lowest Brier on every benchmark and the lowest Mean\.Table[E\.2](https://arxiv.org/html/2605.30653#A5.SS2)compares probability calibration using Brier score, where lower values indicate better calibrated confidence\. The baselines cover three categories: post\-hoc plurality calibrators, LLM\-elicited confidence estimators, and trained calibrators\.CAGE\-Calachieves the lowest Brier score on every benchmark and the best mean score overall, reducing the average Brier score from the strongest baseline, DiscoUQ\-LLM, from 14\.37 to 11\.19\. This suggests that explicitly modeling agent dependencies and communication topology provides a stronger calibration signal than calibrating vote share, eliciting verbalized confidence, or using hand\-crafted disagreement features alone\.
StrategyTriviaQATruthfulQAMMLU\-ProGSM8KBBHMean\\rowcolorgray\!20Fixed topologyiid83\.6720\.0041\.5094\.6772\.5162\.41debate84\.0123\.6746\.9494\.6775\.9564\.98chain82\.6520\.6746\.2694\.0075\.9563\.83hub\-spoke84\.0120\.3342\.8695\.3372\.8563\.02tree82\.9919\.6741\.5095\.0070\.1061\.80Per\-bench best fixed84\.0123\.6746\.9495\.3375\.9565\.18\\rowcolorgray\!20Selection w/o learned confidenceMajority over topologies84\.6919\.0040\.8295\.3375\.2662\.95Highest plurality share84\.3522\.3344\.5695\.3378\.6964\.98Highest mean log\-prob84\.6922\.3340\.8294\.6774\.2363\.29\\rowcolorgray\!20Confidence\-routed\\rowcolor\[RGB\]222,230,241CAGE\-Select\(ours\)84\.6924\.0050\.3495\.3381\.7967\.23Oracle topology88\.1031\.6765\.3196\.3386\.2573\.43
Table 9:Confidence\-routed topology selection\.Mean accuracy on the1,4791\{,\}479matched\-NNtest groups \(one per query, five candidate panels\)\.*Per\-bench best fixed*picks each benchmark’s best topology on validation\.## Appendix FSelection, Generalization, and Robustness
### F\.1CAGE\-Select: Confidence\-Routed Topology Selection
Table[E\.2](https://arxiv.org/html/2605.30653#A5.SS2)evaluates whether the calibrated confidence produced byCAGE\-Calcan be used to select, for each query, the communication topology whose panel answer is most likely to be correct\. We refer to this routing procedure asCAGE\-Select\.
##### Why per\-query routing\.
The main results show that no single communication topology is best on every query\. Some questions are answered most reliably under iid, where independent agents avoid the herding that arises when they communicate\. Others benefit from debate or hub\-spoke, where peer exchange resolves ambiguity that a lone agent would miss\. A*fixed*\-topology system is therefore always sub\-optimal in expectation: it commits, at design time, to a single communication structure regardless of which structure is appropriate for the specific input\.CAGE\-Selectreplaces that design\-time commitment with a run\-time choice driven by panel\-level confidence\.
##### Routing procedure\.
For each queryxx,CAGE\-Selectruns the panel under all candidate topologies, producing a plurality answery^\(x,T\)\\hat\{y\}\(x,T\)and a calibrated correctness probabilityp^\(x,T\)\\hat\{p\}\(x,T\)fromCAGE\-Calfor eachTT\. It then returns the answer ofT⋆\(x\)=argmaxTp^\(x,T\)T^\{\\star\}\(x\)=\\arg\\max\_\{T\}\\hat\{p\}\(x,T\)\. Becausep^\\hat\{p\}is trained as a panel\-level correctness probability with the same target across topologies, the values are directly comparable and the routing reduces to a singleargmax\\arg\\maxover a small candidate pool\. We compareCAGE\-Selectagainst each fixed topology, an oracle that always picks the topology carrying the correct answer, and heuristic routers that select by plurality share or mean per\-token log\-probability\.CAGE\-Selectconsistently beats every fixed topology and every heuristic rule and closes a substantial fraction of the oracle gap\. The heuristics fail in characteristic ways: plurality share is fooled by communication\-induced herding that inflates agreement without improving accuracy, and mean log\-probability rewards confident generation regardless of whether the panel agrees\. A calibrated panel\-level confidence is needed to navigate both failure modes, and the same head that flags unreliable panels can therefore double as an inference\-time selector that turns topology choice into a per\-query decision\.
### F\.2Panel\-Size Robustness
GraphCalDiscoUQ\-LLMCAGE\-Cal\(ours\)Held\-out topologyAUROCAUARCAUROCAUARCAUROCAUARCiid67\.6767\.6769\.6369\.6374\.8974\.8971\.4571\.4584\.1584\.1575\.2375\.23debate69\.9569\.9571\.5971\.5970\.3270\.3271\.6771\.6780\.4980\.4976\.6076\.60chain67\.9867\.9870\.1170\.1166\.9766\.9770\.1170\.1182\.5582\.5577\.3277\.32hub\-spoke70\.8070\.8071\.8771\.8773\.5173\.5172\.4472\.4480\.1280\.1274\.7174\.71tree73\.5873\.5870\.6370\.6375\.6675\.6670\.7270\.7283\.7483\.7474\.9274\.92Mean70\.0070\.0070\.7770\.7772\.2772\.2771\.2871\.2882\.2175\.76
Table 10:LOTO per\-held\-out breakdown\(percent, AUROC and AUARC\)\. Each row trains on the other four topologies and tests on the held\-out fifth\. The Mean row reproduces Table[2](https://arxiv.org/html/2605.30653#S7.T2)’s Mean column\.Table[10](https://arxiv.org/html/2605.30653#A6.T10)evaluates leave\-one\-topology\-out generalization, where each row trains the calibrator on four communication topologies and tests it on the held\-out topology\.CAGE\-Calconsistently outperforms GraphCal and DiscoUQ\-LLM in both AUROC and AUARC across all held\-out topologies, showing that its graph\-based dependency modeling generalizes beyond the topologies observed during training\. This suggests thatCAGE\-Callearns reusable structural signals of agent dependence rather than merely fitting topology\-specific patterns\.
### F\.3Main Structural Findings Are Robust to Panel Size
Quantity𝑵=𝟏𝟎\\bm\{N\{=\}10\}𝑵=𝟐𝟎\\bm\{N\{=\}20\}𝚫\\bm\{\\Delta\}Accuracy anchors \(%\)Single\-agent baseline54\.1354\.72\+0\.59iid plurality60\.2564\.34\+4\.09Best topology per cell63\.0066\.40\+3\.41Any\-agent\-correct oracle84\.6888\.08\+3\.39Ensembling gain \(iid−\-single\)\+6\.12\+9\.62\+3\.50Communication gain \(best−\-iid\)\+2\.75\+2\.06\-0\.69Aggregation gap \(oracle−\-best\)\+21\.69\+21\.67\-0\.01OLSβ\\betaonWij\\mathbf\{W\}\_\{ij\}\(pp\)Same backbone family\+10\.17\+10\.36\+0\.19Same prompting role\+1\.02\+0\.89\-0\.14is\_chain\+18\.38\+17\.87\-0\.51is\_debate\+12\.19\+11\.76\-0\.43is\_tree\+6\.20\+6\.76\+0\.55is\_hub\_spoke\+1\.46\+1\.15\-0\.31Peak regression rate \(%\)chain / TruthfulQA32\.4531\.73\-0\.72tree / TruthfulQA31\.1830\.52\-0\.66
Table 11:NN\-scaling robustness\. Headline quantities under matched\-N=10N\{=\}10vs\. natural\-N=20N\{=\}20\. Structural findings move by less than±1\\pm 1pp\.Table[11](https://arxiv.org/html/2605.30653#A6.T11)evaluates whether our structural findings are robust to the number of agents in the panel by comparing matchedN=10N=10panels with the naturalN=20N=20setting\. Increasing the panel size improves standard accuracy anchors, such as iid plurality accuracy and the any\-agent\-correct oracle, but the key structural quantities remain stable\. In particular, the OLS coefficients on pairwise agent dependence, including same backbone family and topology\-induced effects such as chain and debate, change by less than about one percentage point\. The peak regression rates on TruthfulQA are also nearly unchanged\. These results suggest that the observed population\-side dependence and topology\-induced coupling are not artifacts of a particular panel size\.
## Appendix GPrompt Design
##### Agent role prompts
Each agent in a panel is parameterized by one of five atomic reasoning roles whose prompts are taken from the originating papers; a single shared format guard is appended so that downstream parsing of the "Answer:⟨X⟩\\langle X\\rangle" terminator is uniform across roles and benchmarks\. A role is restricted to one LLM call and may not internally simulate multiple personas, which keeps the role axis orthogonal to the topology axis\. Agent prompting roles used to generate panels are shown in Figure[11](https://arxiv.org/html/2605.30653#A7.F11)\.
Agent prompting rolesShared format guard\(appended to every role below\):
Output the final answer at the end as exactly one line: "Answer: <your short answer\>"Role 1:direct\(zero\-shot baseline\)Answer the question directly\. Do not show reasoning\.Role 2:cot\(Weiet al\.,[2022](https://arxiv.org/html/2605.30653#bib.bib57)\)Let’s think step by step\. Reason through the problem, then commit to a final answer\.Role 3:plan\_solve\(Wanget al\.,[2023a](https://arxiv.org/html/2605.30653#bib.bib58)\)First, understand the problem and devise a brief plan in 2\-\-4 steps\. Then carry out the plan to solve the problem\.Role 4:step\_back\(Zhenget al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib59)\)Take a step back\. State the high\-level concept, principle, or category that this problem falls under\. Then use that principle to solve the specific problem\.Role 5:analogical\(Yasunagaet al\.,[2024](https://arxiv.org/html/2605.30653#bib.bib60)\)Recall 2\-\-3 analogous problems you have seen before\. Briefly describe each in one sentence\. Then use what you learned from those analogies to solve this problem\.Figure 11:Agent prompting roles used to generate panels\.Five atomic reasoning roles parameterize each agent, adapted verbatim from the cited source papers\. A role must not internally simulate multiple personas or multiple LLM calls; this keeps role and topology orthogonal\.
##### LLM\-elicited baseline prompts
For completeness we report the prompt used by the LLM\-Cal baseline \(Section[6\.2](https://arxiv.org/html/2605.30653#S6.SS2)\), the zero\-shot LLM\-elicited calibrator that asks a frozen LLM to map \(question, panel answers, plurality\) to a correctness probability\. The optional\+topovariant additionally injects a one\-line description of the panel’s communication topology; the exact label mapping is shown in the same box\. The prompt used for LLM\-Cal is shown in Figure[12](https://arxiv.org/html/2605.30653#A7.F12)\.
LLM\-Cal \(zero\-shot LLM\-elicited calibrator\)System prompt:
You are a calibration assistant\. Given a question and N candidate answers from N language model agents, plus the plurality \(most\-voted\) answer, estimate the probability that the plurality answer is correct\. Respond with a single decimal number between 0 and 1, no other text\.User prompt template:
Question: \{query\}Panel topology: \{topology\_description\}\# only in the \+topo variantPlurality answer: \{plurality\}
All agent answers: 1\. \{answer\_1\} 2\. \{answer\_2\} … N\. \{answer\_N\}
Probability the plurality answer is correct \(0\-\-1\):
Topology descriptions\(used by the\+topovariant\):iid→\\rightarrow"iid \(independent\)"
debate→\\rightarrow"debate \(full\-mesh cross\-critique\)"
chain→\\rightarrow"chain \(sequential\)"
hub\_spoke→\\rightarrow"hub\-spoke \(centralized aggregator\)"
tree→\\rightarrow"tree \(hierarchical aggregation by depth\)"Figure 12:LLM\-Cal: zero\-shot LLM\-elicited calibration baseline\.The model is asked to map \(question, panel answers, plurality\) to a single probability that the plurality answer is correct\. The optional\+topovariant additionally injects a one\-line description of the panel’s communication topology into the user prompt\. CAGE\-Cal itself does not issue any LLM prompts at inference time\.
## Appendix HUse of AI Assistants
AI assistants were used only for language polishing\.Similar Articles
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
This paper identifies a failure mode in LLM-based multi-agent systems where plans fail due to agents misjudging their knowledge (epistemic miscalibration) and proposes EPC-AW, a workflow that uses information-consistency and epistemic state refinement to improve system-level success by 9.75%.
Calibrating LLMs with Semantic-level Reward
Proposes CSR, a framework that calibrates LLMs directly in semantic space using a novel semantic calibration reward, reducing ECE by up to 40% and improving AUROC by up to 31% over verbalized-confidence baselines across multiple datasets.
GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.
MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination
MARGIN is a runtime confidence calibration method for multi-agent foundation model systems that learns per-agent calibration factors online, improving pairwise resolution from below random to 70-89% on hard benchmarks, requiring no held-out data or retraining.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
This paper introduces Agent-BOM, a unified graph representation for security auditing in LLM-based agentic systems. It addresses the semantic gap in post-hoc auditing by modeling static capabilities and dynamic runtime states to detect complex attack chains like memory poisoning and tool misuse.