MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

arXiv cs.LG Papers

Summary

MARGIN is a runtime confidence calibration method for multi-agent foundation model systems that learns per-agent calibration factors online, improving pairwise resolution from below random to 70-89% on hard benchmarks, requiring no held-out data or retraining.

arXiv:2605.22949v1 Announce Type: new Abstract: Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:57 AM

# Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination
Source: [https://arxiv.org/html/2605.22949](https://arxiv.org/html/2605.22949)
###### Abstract

Foundation model agents increasingly operate in multi\-agent deployments where a coordinator must decide which agent’s response to trust\. The standard approach weights agents by their self\-reported confidence, but recent evidence shows that foundation model confidence is systematically miscalibrated and, on hard tasks,*inversely*correlated with accuracy\. Design\-time calibration methods \(temperature scaling, Platt scaling, histogram binning\) cannot address this problem because they fit a fixed correction to held\-out data and degrade under distribution shift\. We present MARGIN \(Multi\-Agent Runtime Grading via Incremental Normalisation\), an online calibration method that learns per\-agent, per\-confidence\-band calibration factors from the task stream itself, requiring no model access, no held\-out data, and no retraining\. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults\. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3–6×\\timeslower calibration error than the best design\-time baseline under distribution shift\. In multi\-agent selection, raw verbalized confidence produces pairwise resolution*worse than random*\(45–56%\) on hard benchmarks\. MARGIN corrects this completely, raising pairwise resolution to 70–89% and surpassing the always\-best\-model oracle on three of four benchmarks\. Six formal propositions characterise convergence, tracking speed, and the optimality of symmetric updates for non\-strategic agents, with all predictions illustrated empirically\.

Keywords:confidence calibration, multi\-agent systems, foundation models, online learning, distribution shift\.

## 1Introduction

Foundation models are increasingly deployed as autonomous agents that observe, reason, and act without human intervention\[[30](https://arxiv.org/html/2605.22949#bib.bib30),[32](https://arxiv.org/html/2605.22949#bib.bib27)\]\. In multi\-agent deployments, a coordinator receives predictions from several agents and must decide which response to trust\. The natural approach is to weight each agent by its self\-reported confidence\. This assumes that confidence is informative: that an agent expressing 90% confidence is more likely correct than one expressing 70%\.

The assumption is wrong\. Studies consistently show that foundation model confidence is miscalibrated\[[12](https://arxiv.org/html/2605.22949#bib.bib1),[33](https://arxiv.org/html/2605.22949#bib.bib12),[8](https://arxiv.org/html/2605.22949#bib.bib15)\]\. An agent claiming 90% confidence may be correct only 60% of the time\. More concerning, the miscalibration is not merely imprecise but can be*inverted*: on hard tasks, weaker models express higher confidence than stronger ones, so that trusting the most confident agent systematically selects the wrong answer\. In our experiments, raw verbalized confidence produces pairwise resolution of 45–56% on hard code generation benchmarks, worse than a coin flip\.

Calibration methods exist\. Temperature scaling\[[12](https://arxiv.org/html/2605.22949#bib.bib1)\], Platt scaling\[[26](https://arxiv.org/html/2605.22949#bib.bib2)\], and histogram binning\[[24](https://arxiv.org/html/2605.22949#bib.bib4)\]learn correction functions from held\-out validation data\. Recent work extends these to language models specifically, through auxiliary calibration models\[[27](https://arxiv.org/html/2605.22949#bib.bib18)\], confidence tuning\[[20](https://arxiv.org/html/2605.22949#bib.bib19)\], and disagreement\-aware alignment\[[22](https://arxiv.org/html/2605.22949#bib.bib20)\]\. All are design\-time methods\. They fit a correction once, before deployment, and the correction is then fixed\. When the deployment distribution differs from the calibration set, as it inevitably does, the correction degrades\. Our experiments show that design\-time baselines degrade 3–4×\\timesunder distribution shift, with ECE rising from single digits to 37–63\.

This creates a gap\. Foundation model agents operate in environments where the task distribution shifts continuously: new problem types appear, user behaviour changes, and model updates alter the confidence landscape\. No existing method learns calibration at runtime, from the task stream itself, without access to model internals\.

We present MARGIN \(Multi\-Agent Runtime Grading via Incremental Normalisation\), an online confidence calibration method for multi\-agent foundation model systems\. MARGIN maintains per\-agent, per\-confidence\-band calibration factors that are updated continuously via symmetric exponentially weighted moving averages \(EWMA\)\. The method treats each agent as a black box, observing only its predictions, stated confidence, and eventual outcomes\. It requires no held\-out calibration data, no access to logits or weights, and no retraining\. Bayesian shrinkage blending stabilises estimates during the cold\-start period\. The entire method has three hyperparameters with robust defaults \(α=0\.04\\alpha=0\.04,K=3K=3bands,ks=100k\_\{s\}=100\) and negligible computational overhead\.

We evaluate MARGIN across 19 foundation models \(10 cloud API, 9 local\), 8 benchmarks spanning code generation, question answering, and mathematics, and over 50,000 observations\. The key findings are:

- •Distribution shift\.Under severe shift, MARGIN achieves 3–6×\\timeslower calibration error than the best design\-time baseline \(ECE 6–11 vs 37–63\)\. Under moderate shift, MARGIN approximately halves the best baseline\. The advantage scales monotonically with shift severity\.
- •Confidence inversion\.Raw verbalized confidence is worse than random at pairwise resolution on three of four hard benchmarks \(44\.8–55\.5%\)\. MARGIN corrects this completely, raising pairwise resolution to 70–89%\.
- •Multi\-agent selection\.MARGIN\-calibrated selection surpasses the always\-best\-model baseline on three of four benchmarks and reaches 83–97% of oracle performance\.
- •Symmetric optimality\.Symmetric EWMA decisively outperforms all asymmetric configurations\. We prove that asymmetric updates introduce systematic bias for non\-strategic agents whose confidence errors are epistemic rather than strategic, and observe 3–4×\\timesECE degradation for all tested asymmetric rates, consistent with the theoretical prediction\.

Six formal propositions characterise MARGIN’s convergence, tracking speed, bias\-variance tradeoff, and the optimality conditions for symmetric updates\. The theoretical predictions are illustrated by the empirical results throughout\.

The remainder of this paper is organised as follows\. Section[2](https://arxiv.org/html/2605.22949#S2)surveys related work\. Section[3](https://arxiv.org/html/2605.22949#S3)presents the method\. Section[4](https://arxiv.org/html/2605.22949#S4)states formal properties\. Sections[5](https://arxiv.org/html/2605.22949#S5)–[8](https://arxiv.org/html/2605.22949#S8)describe the experimental evaluation\. Section[10](https://arxiv.org/html/2605.22949#S10)reports ablation studies\. Section[11](https://arxiv.org/html/2605.22949#S11)discusses implications and limitations\. Section[12](https://arxiv.org/html/2605.22949#S12)concludes\.

## 2Related Work

MARGIN sits at the intersection of calibration, multi\-agent coordination, reputation systems, and online learning\. We survey each area and identify the gap that MARGIN fills\.

### 2\.1Design\-Time Calibration

The calibration problem for neural networks was established by Guo et al\.\[[12](https://arxiv.org/html/2605.22949#bib.bib1)\], who showed that modern deep networks are poorly calibrated and that a single learned temperature parameter can substantially reduce expected calibration error \(ECE\)\[[24](https://arxiv.org/html/2605.22949#bib.bib4)\]on held\-out data\. Platt scaling\[[26](https://arxiv.org/html/2605.22949#bib.bib2)\]fits a logistic regression, and histogram binning\[[24](https://arxiv.org/html/2605.22949#bib.bib4)\]provides a non\-parametric alternative\. Minderer et al\.\[[23](https://arxiv.org/html/2605.22949#bib.bib3)\]revisited these findings for newer architectures and found that calibration properties vary substantially across model families, but that the methods themselves remain design\-time: a correction is fitted once and applied without further adaptation\.

Recent work has extended design\-time calibration to large language models\. Shen et al\.\[[27](https://arxiv.org/html/2605.22949#bib.bib18)\]propose Thermometer, an auxiliary model trained across multiple tasks to produce calibrated confidence estimates for new tasks\. ConfTuner\[[20](https://arxiv.org/html/2605.22949#bib.bib19)\]fine\-tunes the language model itself to produce better\-calibrated verbalized confidence\. DACA\[[22](https://arxiv.org/html/2605.22949#bib.bib20)\]performs post\-hoc temperature calibration by aligning a post\-trained model’s confidence with a pre\-trained reference on agreement examples\. These methods represent the current state of the art in LLM calibration\.

All design\-time methods share a fundamental limitation: they produce a fixed correction that assumes the deployment distribution matches the calibration set\. When it does not, the correction degrades\. Temperature scaling, for instance, learns a single scalar\. If the model is overconfident on one task type and underconfident on another, a single temperature cannot correct both\. More critically, if the task distribution shifts after deployment, the correction becomes stale with no mechanism for recovery\. Ovadia et al\.\[[25](https://arxiv.org/html/2605.22949#bib.bib8)\]document this failure mode systematically across neural uncertainty estimators, showing that every method they evaluate degrades substantially under shift\. Ensemble approaches\[[19](https://arxiv.org/html/2605.22949#bib.bib9)\]improve uncertainty estimates but require training or fine\-tuning multiple models\. MARGIN addresses the limitation differently, by learning calibration factors online from the deployment stream itself without any retraining\.

### 2\.2LLM Confidence and Uncertainty Estimation

The reliability of LLM self\-reported confidence has been studied extensively\. Kadavath et al\.\[[15](https://arxiv.org/html/2605.22949#bib.bib13)\]showed that language models have partial self\-knowledge about their own uncertainty, but that this self\-assessment does not generalise across task distributions\. Xiong et al\.\[[33](https://arxiv.org/html/2605.22949#bib.bib12)\]conducted a systematic evaluation of confidence elicitation methods across frontier models and found that none produce well\-calibrated outputs across tasks\. Even GPT\-4 achieved an AUROC of only 62\.7% for failure prediction, barely above random\.

Two comprehensive surveys frame the current landscape\. Geng et al\.\[[8](https://arxiv.org/html/2605.22949#bib.bib15)\]survey confidence estimation and calibration methods for LLMs, covering verbalized confidence, logit\-based methods, ensemble approaches, and post\-hoc calibration\. Liu et al\.\[[21](https://arxiv.org/html/2605.22949#bib.bib14)\]survey uncertainty quantification more broadly, including Bayesian approaches and conformal prediction\[[1](https://arxiv.org/html/2605.22949#bib.bib10)\]\. Both surveys document the severity of the miscalibration problem but propose no runtime solution\.

Confidence signals for LLMs fall into three categories\. Verbalized confidence\[[29](https://arxiv.org/html/2605.22949#bib.bib21)\]prompts the model to state a numerical confidence alongside its prediction\. This is broadly available but poorly calibrated, as models tend toward overconfidence and the mapping from internal uncertainty to a stated number is unreliable\. Consistency confidence\[[31](https://arxiv.org/html/2605.22949#bib.bib22)\]runs the same query multiple times and measures agreement across samples\. A third line of work constructs semantic\-level uncertainty measures over generated answers\[[17](https://arxiv.org/html/2605.22949#bib.bib16),[6](https://arxiv.org/html/2605.22949#bib.bib17)\], treating multiple sampled outputs as evidence of underlying uncertainty at the meaning level rather than the token level\. MARGIN is agnostic to the confidence source and applies the same online calibration across modalities\.

A common thread across all surveyed work is the absence of runtime adaptation\. Confidence estimation methods produce a score; calibration methods correct that score using a pre\-fitted function\. Neither learns from deployment outcomes\. MARGIN occupies this gap: it takes whatever confidence signal is available \(verbalized or consistency\) and learns, at runtime, how much that signal can be trusted for each agent and confidence level\.

### 2\.3Multi\-Agent Coordination and Debate

Multi\-agent debate, where multiple LLM instances propose and refine answers through iterative discussion, has emerged as a prominent coordination paradigm\. Du et al\.\[[5](https://arxiv.org/html/2605.22949#bib.bib23)\]showed that multi\-agent debate improves factuality and reasoning by exposing models to alternative perspectives\. Frameworks such as AutoGen\[[32](https://arxiv.org/html/2605.22949#bib.bib27)\]operationalise this pattern for practical deployment\. However, debate assumes that agents can productively critique each other’s reasoning\. La Malfa et al\.\[[18](https://arxiv.org/html/2605.22949#bib.bib24)\]challenge this assumption, arguing that current LLM multi\-agent systems lack core properties of classical MAS such as social interaction and structured environments\. Smit et al\.\[[28](https://arxiv.org/html/2605.22949#bib.bib25)\]benchmark debate strategies and find that multi\-agent debate does not reliably outperform simpler prompting approaches such as self\-consistency\.

Heterogeneous approaches attempt to impose structure on multi\-agent coordination\. Zhou and Chen\[[34](https://arxiv.org/html/2605.22949#bib.bib26)\]propose A\-HMAD, an adaptive heterogeneous debate framework that assigns distinct roles to different agent types and rates contributions via a consensus optimiser\. This imposes architectural structure but does not learn, from observed outcomes, how much to trust any particular agent’s stated confidence\.

Confidence\-based model selection represents a complementary approach\. Gerych et al\.\[[9](https://arxiv.org/html/2605.22949#bib.bib28)\]train an auxiliary regression model to predict each LLM’s confidence for a given query and route the query to the most confident model\-prompt pair\. Chen et al\.\[[3](https://arxiv.org/html/2605.22949#bib.bib29)\]propose FrugalGPT, a cost\-aware cascade that routes queries to progressively more capable models until a learned confidence threshold is met\. Both approaches assume that raw or learned confidence is a reliable signal for routing\. Neither tracks prediction outcomes to compute calibration factors, and neither adjusts confidence values based on demonstrated per\-agent, per\-band reliability\.

MARGIN operates at the layer beneath all of these approaches\. Before confidence can be used for debate weighting, hierarchical role assignment, or query routing, it must first be calibrated to reflect actual reliability\. A debate framework that weights agents by raw confidence will systematically amplify the voices of overconfident agents\. A routing system that sends queries to the most confident model will systematically choose the wrong model when confidence is inverted\. MARGIN provides the calibration layer that makes downstream coordination mechanisms reliable\.

### 2\.4Trust and Reputation Systems

Trust and reputation systems have a long history in multi\-agent and distributed systems\. Jøsang et al\.\[[14](https://arxiv.org/html/2605.22949#bib.bib31)\]survey the landscape, covering computational trust models, reputation aggregation, and the distinction between direct experience and third\-party recommendations\. EigenTrust\[[16](https://arxiv.org/html/2605.22949#bib.bib32)\]computes global reputation scores in peer\-to\-peer networks by iterating local trust assessments, achieving robust reputation even under adversarial conditions\.

These systems track*overall*agent reliability: is this agent generally trustworthy? MARGIN tracks something more specific:*conditional*reliability as a function of stated confidence level\. An agent might be highly reliable when it expresses moderate confidence but systematically overconfident at high confidence levels\. A single reputation score cannot capture this variation\. MARGIN’s per\-band calibration factors provide the fine\-grained correction that flat reputation cannot\.

A second distinction concerns the agent model\. Classical reputation systems are designed for strategic agents that may lie, collude, or manipulate their reputation\. The update rules are typically asymmetric, penalising bad behaviour faster than rewarding good behaviour, to deter strategic defection\. Foundation model agents, by contrast, have fixed policies and do not strategically adjust their confidence in response to external feedback\. Their miscalibration is epistemic, not strategic\. As we show formally in Proposition[5](https://arxiv.org/html/2605.22949#Thmtheorem5)and empirically in Section[10\.3](https://arxiv.org/html/2605.22949#S10.SS3), the symmetric update that is optimal for non\-strategic agents performs catastrophically under the asymmetric rules designed for strategic settings\.

### 2\.5Online Learning and Calibration in Non\-Stationary Environments

The formal definition of calibration is due to Dawid\[[4](https://arxiv.org/html/2605.22949#bib.bib5)\]: a sequence of probability forecasts is calibrated if, conditional on any stated probabilitypp, the observed outcome frequency converges topp\. Foster and Vohra\[[7](https://arxiv.org/html/2605.22949#bib.bib6)\]subsequently established that asymptotic calibration is achievable by a randomised forecasting rule even against adversarial sequences, a cornerstone result for online calibration\. Cesa\-Bianchi and Lugosi\[[2](https://arxiv.org/html/2605.22949#bib.bib34)\]develop the general theory of prediction with expert advice and sequential learning that underlies modern online calibration methods\.

The exponentially weighted moving average \(EWMA\) is a classical tool for tracking non\-stationary statistics\[[13](https://arxiv.org/html/2605.22949#bib.bib33)\]\. Originally developed for statistical process control, the EWMA assigns exponentially decaying weights to past observations, providing a principled tradeoff between tracking speed and estimation noise\. The effective memory window of approximately1/α1/\\alphaobservations makes the method inherently adaptive: stale observations are automatically down\-weighted as the environment changes\. A closely related line of work in conformal prediction adapts to distribution shift at the prediction\-set level: Gibbs and Candès\[[10](https://arxiv.org/html/2605.22949#bib.bib11)\]show that conformal thresholds can be updated online to maintain target coverage under arbitrary shift\. MARGIN and adaptive conformal inference are complementary: the former learns point calibration factors suitable for multi\-agent selection; the latter produces distribution\-free prediction intervals\.

MARGIN’s contribution is not the EWMA itself but its application to confidence calibration in a structured way: per\-agent, per\-confidence\-band tracking with Bayesian shrinkage blending and a formal analysis of symmetric versus asymmetric updates\. The per\-band structure enables fine\-grained calibration profiles that a single EWMA per agent cannot provide\. The shrinkage blending addresses the cold\-start problem inherent in stratified tracking, where some \(agent, band\) pairs may accumulate observations slowly\. The symmetric optimality result, which contradicts the standard intuition from strategic reputation systems, is specific to the non\-strategic agent setting and, to our knowledge, has not been formalised previously\.

Table[1](https://arxiv.org/html/2605.22949#S2.T1)summarises the positioning of MARGIN relative to prior work along four dimensions\.

Table 1:Positioning of MARGIN relative to prior work\.

## 3Method

### 3\.1Problem Formulation

Consider a pool ofNNfoundation model agents𝒜=\{a1,…,aN\}\\mathcal\{A\}=\\\{a\_\{1\},\\ldots,a\_\{N\}\\\}deployed over a stream of tasks\{q1,q2,…\}\\\{q\_\{1\},q\_\{2\},\\ldots\\\}\. For each taskqtq\_\{t\}, a subset of agents𝒜t⊆𝒜\\mathcal\{A\}\_\{t\}\\subseteq\\mathcal\{A\}provides responses\. Each responding agentaia\_\{i\}produces a predictiony^i,t\\hat\{y\}\_\{i,t\}together with a confidence scoreci,t∈\[0,1\]c\_\{i,t\}\\in\[0,1\]representing its self\-assessed probability of correctness\. After a delay, a binary outcomeoi,t∈\{0,1\}o\_\{i,t\}\\in\\\{0,1\\\}is observed, whereoi,t=1o\_\{i,t\}=1ify^i,t\\hat\{y\}\_\{i,t\}matches the ground truth andoi,t=0o\_\{i,t\}=0otherwise\.

The core problem is thatci,tc\_\{i,t\}is unreliable\. Foundation models are frequently overconfident, and the degree of miscalibration varies across models, across confidence ranges within the same model, and over time as the task distribution shifts\. A model that is well\-calibrated on one benchmark may be catastrophically miscalibrated on another\. Design\-time calibration methods \(temperature scaling, Platt scaling, histogram binning\) fit a correction function to held\-out data, but this correction degrades whenever the deployment distribution differs from the calibration set\.

We impose three constraints that reflect realistic multi\-agent deployment:

1. 1\.No model access\.Agents are black boxes\. We observe only their predictions, stated confidence, and eventual outcomes\. No access to internal logits, weights, or training data is available\.
2. 2\.No held\-out calibration set\.The method must learn from the task stream itself\. In deployment, the distribution is unknown in advance and may shift at any time\.
3. 3\.Online operation\.Calibration must update incrementally as new observations arrive, without reprocessing historical data\.

Under these constraints, the goal is to learn a calibration functionfi:\[0,1\]→\[0,1\]f\_\{i\}:\[0,1\]\\to\[0,1\]for each agent such that the calibrated confidencec~i,t=fi​\(ci,t\)\\tilde\{c\}\_\{i,t\}=f\_\{i\}\(c\_\{i,t\}\)satisfies

ℙ​\(oi,t=1∣c~i,t=p\)≈p\\mathbb\{P\}\(o\_\{i,t\}=1\\mid\\tilde\{c\}\_\{i,t\}=p\)\\approx p\(1\)for allp∈\[0,1\]p\\in\[0,1\], and to use these calibrated confidences for multi\-agent selection: choosing the most reliable response from the pool\.

### 3\.2Confidence\-Band Stratified Tracking

Rather than learning a single calibration correction per agent, MARGIN partitions the confidence range\[0,1\]\[0,1\]intoKKdisjoint bands\{B1,…,BK\}\\\{B\_\{1\},\\ldots,B\_\{K\}\\\}and maintains a separate calibration factor for each \(agent, band\) pair\. This reflects the empirical observation that miscalibration is not uniform across confidence levels: a model may be well\-calibrated when it expresses moderate confidence but severely overconfident at high confidence, or vice versa\.

We useK=3K=3equal\-width bands by default:

B1=\[0,13\),B2=\[13,23\),B3=\[23,1\]\.B\_\{1\}=\[0,\\tfrac\{1\}\{3\}\),\\quad B\_\{2\}=\[\\tfrac\{1\}\{3\},\\tfrac\{2\}\{3\}\),\\quad B\_\{3\}=\[\\tfrac\{2\}\{3\},1\]\.\(2\)
Letκ​\(c\)∈\{1,…,K\}\\kappa\(c\)\\in\\\{1,\\ldots,K\\\}denote the band index for confidence valuecc\. For each agentaia\_\{i\}and bandkk, we maintain two running estimates:

- •a^i,k\\hat\{a\}\_\{i,k\}: the empirical accuracy rate of agentaia\_\{i\}when its confidence falls in bandBkB\_\{k\},
- •c¯i,k\\bar\{c\}\_\{i,k\}: the mean confidence expressed by agentaia\_\{i\}within bandBkB\_\{k\}\.

Both are initialised to the band midpointmkm\_\{k\}, so that the initial calibration factorγi,k=a^i,k/c¯i,k=1\\gamma\_\{i,k\}=\\hat\{a\}\_\{i,k\}/\\bar\{c\}\_\{i,k\}=1and raw confidence passes through unchanged before any observations\.

The choice ofK=3K=3balances calibration granularity against per\-band sample requirements\. Fewer bands accumulate observations faster and are more robust under severe distribution shift, where each band must re\-learn its factor from limited data\. More bands provide finer\-grained correction when observations are plentiful\. We evaluate this tradeoff empirically in Section[10\.2](https://arxiv.org/html/2605.22949#S10.SS2)\.

### 3\.3EWMA Update Mechanism

Both running estimates are updated using an exponentially weighted moving average \(EWMA\) with a constant learning rateα∈\(0,1\)\\alpha\\in\(0,1\)\. When agentaia\_\{i\}produces a prediction at timettwith confidenceci,t∈Bkc\_\{i,t\}\\in B\_\{k\}and outcomeoi,to\_\{i,t\}is subsequently observed, the updates are:

a^i,k\\displaystyle\\hat\{a\}\_\{i,k\}←\(1−α\)​a^i,k\+α​oi,t,\\displaystyle\\leftarrow\(1\-\\alpha\)\\,\\hat\{a\}\_\{i,k\}\+\\alpha\\,o\_\{i,t\},\(3\)c¯i,k\\displaystyle\\bar\{c\}\_\{i,k\}←\(1−α\)​c¯i,k\+α​ci,t\.\\displaystyle\\leftarrow\(1\-\\alpha\)\\,\\bar\{c\}\_\{i,k\}\+\\alpha\\,c\_\{i,t\}\.\(4\)
The constant learning rate gives exponentially decaying weights to past observations: the weight on an observationτ\\tausteps in the past is\(1−α\)τ⋅α\(1\-\\alpha\)^\{\\tau\}\\cdot\\alpha, with an effective memory window of approximately1/α1/\\alphaobservations\. This is the key property that enables adaptation to distribution shift\. Unlike a simple running average, which gives equal weight to all past observations, the EWMA automatically down\-weights stale observations as the environment changes\. The tradeoff is between tracking speed \(higherα\\alpha, faster adaptation to shift\) and estimation noise \(lowerα\\alpha, more stable estimates under stationarity\)\. We formalise this in Proposition[4](https://arxiv.org/html/2605.22949#Thmtheorem4)and evaluate it empirically in Section[10\.1](https://arxiv.org/html/2605.22949#S10.SS1)\.

Symmetric updates\.A natural alternative is to use asymmetric learning rates: a largerαdown\\alpha\_\{\\text\{down\}\}when observed accuracy falls below the current estimate \(penalising overconfidence faster\) and a smallerαup\\alpha\_\{\\text\{up\}\}otherwise\. This is appropriate when agents can strategically manipulate their confidence in response to the calibration signal\. However, foundation model agents have fixed policies and do not adapt their confidence expression in response to external feedback\. Their miscalibration is epistemic, not strategic: confidence errors are approximately zero\-mean over time\. Under this condition, symmetric EWMA is an unbiased estimator of the true accuracy rate, while asymmetric updates introduce systematic bias proportional to\|αup−αdown\|\|\\alpha\_\{\\text\{up\}\}\-\\alpha\_\{\\text\{down\}\}\|\(Proposition[5](https://arxiv.org/html/2605.22949#Thmtheorem5)\)\. We useα=0\.04\\alpha=0\.04throughout and confirm empirically in Section[10\.3](https://arxiv.org/html/2605.22949#S10.SS3)that symmetric updates outperform all tested asymmetric configurations\.

The per\-band calibration factor is then:

γi,k=a^i,kc¯i,k,\\gamma\_\{i,k\}=\\frac\{\\hat\{a\}\_\{i,k\}\}\{\\bar\{c\}\_\{i,k\}\},\(5\)representing the ratio of observed accuracy to stated confidence within the band\. If an agent consistently achieves 60% accuracy when expressing 90% confidence,γi,k≈0\.67\\gamma\_\{i,k\}\\approx 0\.67, appropriately discounting future high\-confidence predictions\.

### 3\.4Bayesian Shrinkage Blending

The per\-band calibration factorγi,k\\gamma\_\{i,k\}is the most informative estimate when sufficient observations have accumulated in bandkk\. Early in the observation stream, however, some \(agent, band\) pairs may have very few observations, leading to high\-variance estimates\. We address this with a hierarchical shrinkage scheme that blends the band\-level factor toward a more stable model\-level prior\.

Letγi,⋅\\gamma\_\{i,\\cdot\}denote the model\-level calibration factor for agentaia\_\{i\}, computed from the EWMA accuracy and confidence estimates aggregated across all bands\. Letni,kn\_\{i,k\}denote the number of observations accumulated for agentaia\_\{i\}in bandkk\. The effective calibration factor is:

γi,keff=ni,kni,k\+ks​γi,k\+ksni,k\+ks​γi,⋅,\\gamma\_\{i,k\}^\{\\text\{eff\}\}=\\frac\{n\_\{i,k\}\}\{n\_\{i,k\}\+k\_\{s\}\}\\,\\gamma\_\{i,k\}\+\\frac\{k\_\{s\}\}\{n\_\{i,k\}\+k\_\{s\}\}\\,\\gamma\_\{i,\\cdot\},\(6\)whereks\>0k\_\{s\}\>0is the shrinkage constant controlling the blending rate\. Whenni,kn\_\{i,k\}is small relative toksk\_\{s\}, the estimate is pulled toward the model\-level factor\. As observations accumulate, the band\-level factor dominates\.

This is an empirical Bayes construction: the model\-level factor serves as a prior informed by the agent’s overall calibration quality, and the band\-level factor is the likelihood contribution from within\-band observations\. The blending weightni,k/\(ni,k\+ks\)n\_\{i,k\}/\(n\_\{i,k\}\+k\_\{s\}\)increases monotonically with observations, ensuring that the effective estimate improves over time regardless of the initial band allocation\.

We useks=100k\_\{s\}=100as the default, which provides meaningful shrinkage during the first∼\\sim100 observations per band while converging to the pure band\-level factor thereafter\. The no\-blending case \(ks=0k\_\{s\}=0\) already outperforms all design\-time baselines; shrinkage provides an additional 40% ECE reduction in the moderate\-to\-severe shift regime that dominates practical deployment \(Section[10\.4](https://arxiv.org/html/2605.22949#S10.SS4)\)\.

### 3\.5Calibrated Confidence Weighting

Given the effective calibration factor, the calibrated confidence for agentaia\_\{i\}at timettis:

c~i,t=γi,κ​\(ci,t\)eff⋅ci,t\.\\tilde\{c\}\_\{i,t\}=\\gamma\_\{i,\\kappa\(c\_\{i,t\}\)\}^\{\\text\{eff\}\}\\cdot c\_\{i,t\}\.\(7\)
For multi\-agent selection, we aggregate across the responding pool using confidence\-weighted voting\. For each candidate answeryyin the response set, the aggregated score is:

s​\(y\)=∑i∈𝒜tc~i,t⋅𝟏​\[y^i,t=y\],s\(y\)=\\sum\_\{i\\in\\mathcal\{A\}\_\{t\}\}\\tilde\{c\}\_\{i,t\}\\cdot\\mathbf\{1\}\[\\hat\{y\}\_\{i,t\}=y\],\(8\)and the selected answer isy^t=arg⁡maxy⁡s​\(y\)\\hat\{y\}\_\{t\}=\\arg\\max\_\{y\}\\,s\(y\)\.

This rule has a clear interpretation: each agent’s vote is weighted by its calibrated confidence, so that a highly confident but historically unreliable agent contributes less than a moderately confident but well\-calibrated one\. When calibration is poor \(as with raw confidence\), overconfident agents dominate the vote and the selection can perform worse than random\. When calibration is accurate, the weighting naturally favours the agent most likely to be correct\.

### 3\.6Dual Modality

MARGIN is agnostic to the source of the confidence signal\. We evaluate two modalities:

Verbalized confidence\.The agent is prompted to state its confidence as a numerical value alongside its prediction\. This is the most broadly available signal, requiring no special infrastructure beyond a prompt template\. However, verbalized confidence is known to be poorly calibrated: foundation models tend toward overconfidence, and the mapping from internal uncertainty to a stated number is unreliable\[[33](https://arxiv.org/html/2605.22949#bib.bib12)\]\.

Consistency confidence\.The same query is presented to the agentMMtimes with non\-zero temperature, and the confidence is computed as the fraction of runs producing the same answer:

ci,tcons=1M​∑m=1M𝟏​\[y^i,t\(m\)=y^i,t\(1\)\]\.c\_\{i,t\}^\{\\text\{cons\}\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{1\}\[\\hat\{y\}\_\{i,t\}^\{\(m\)\}=\\hat\{y\}\_\{i,t\}^\{\(1\)\}\]\.\(9\)Consistency confidence is more expensive \(M×M\\timesthe inference cost\) but provides a behavioural measure of uncertainty that does not depend on the model’s ability to introspect\.

MARGIN applies the same per\-band EWMA calibration to both modalities independently\. No modality\-specific tuning is required: the sameα\\alpha, band count, and shrinkage constant are used throughout\. We compare both modalities in the experimental evaluation and find that consistency confidence provides a stronger base signal, but MARGIN improves both substantially\.

### 3\.7Summary

Algorithm[1](https://arxiv.org/html/2605.22949#alg1)gives the complete procedure, and Figure[1](https://arxiv.org/html/2605.22949#S3.F1)summarises the pipeline\. MARGIN has three hyperparameters: the learning rateα\\alpha, the number of confidence bandsKK, and the shrinkage constantksk\_\{s\}\. We useα=0\.04\\alpha=0\.04,K=3K=3, andks=100k\_\{s\}=100as defaults throughout all experiments\. Sensitivity to each is evaluated in Section[10](https://arxiv.org/html/2605.22949#S10)\.

![Refer to caption](https://arxiv.org/html/2605.22949v1/x1.png)Figure 1:MARGIN pipeline\. Each agent’s raw confidenceci,tc\_\{i,t\}is routed to one ofK=3K\{=\}3confidence bands\. Each band maintains an independent symmetric EWMA of agent accuracy \(learning rateα=0\.04\\alpha\{=\}0\.04\)\. Bayesian shrinkage \(ks=100k\_\{s\}\{=\}100\) blends the band estimate with a global prior during the cold\-start period to produce the calibrated confidencec~i,t\\tilde\{c\}\_\{i,t\}, which drives multi\-agent selection\. After the outcomeyi,ty\_\{i,t\}is observed, only the band used at timettis updated \(dashed feedback\)\.Algorithm 1MARGIN: Online Confidence Calibration and Multi\-Agent Selection1:Agent pool

𝒜\\mathcal\{A\}, bands

\{B1,…,BK\}\\\{B\_\{1\},\\ldots,B\_\{K\}\\\}, learning rate

α\\alpha, shrinkage

ksk\_\{s\}
2:Initialise:For all

i,ki,k:

a^i,k←mk\\hat\{a\}\_\{i,k\}\\leftarrow m\_\{k\},

c¯i,k←mk\\bar\{c\}\_\{i,k\}\\leftarrow m\_\{k\},

ni,k←0n\_\{i,k\}\\leftarrow 0
3:foreach task

qtq\_\{t\}do

4:foreach responding agent

ai∈𝒜ta\_\{i\}\\in\\mathcal\{A\}\_\{t\}do

5:Receive prediction

y^i,t\\hat\{y\}\_\{i,t\}and confidence

ci,tc\_\{i,t\}
6:

k←κ​\(ci,t\)k\\leftarrow\\kappa\(c\_\{i,t\}\)⊳\\trianglerightDetermine confidence band

7:Compute

γi,keff\\gamma\_\{i,k\}^\{\\text\{eff\}\}via Eq\. \([6](https://arxiv.org/html/2605.22949#S3.E6)\)

8:

c~i,t←γi,keff⋅ci,t\\tilde\{c\}\_\{i,t\}\\leftarrow\\gamma\_\{i,k\}^\{\\text\{eff\}\}\\cdot c\_\{i,t\}⊳\\trianglerightCalibrated confidence

9:endfor

10:

y^t←arg⁡maxy​∑i∈𝒜tc~i,t⋅𝟏​\[y^i,t=y\]\\hat\{y\}\_\{t\}\\leftarrow\\arg\\max\_\{y\}\\sum\_\{i\\in\\mathcal\{A\}\_\{t\}\}\\tilde\{c\}\_\{i,t\}\\cdot\\mathbf\{1\}\[\\hat\{y\}\_\{i,t\}=y\]⊳\\trianglerightSelect answer

11:Wait forground truth

yty\_\{t\}
12:foreach responding agent

ai∈𝒜ta\_\{i\}\\in\\mathcal\{A\}\_\{t\}do

13:

oi,t←𝟏​\[y^i,t=yt\]o\_\{i,t\}\\leftarrow\\mathbf\{1\}\[\\hat\{y\}\_\{i,t\}=y\_\{t\}\]
14:

k←κ​\(ci,t\)k\\leftarrow\\kappa\(c\_\{i,t\}\)
15:

a^i,k←\(1−α\)​a^i,k\+α​oi,t\\hat\{a\}\_\{i,k\}\\leftarrow\(1\-\\alpha\)\\,\\hat\{a\}\_\{i,k\}\+\\alpha\\,o\_\{i,t\}⊳\\trianglerightUpdate accuracy

16:

c¯i,k←\(1−α\)​c¯i,k\+α​ci,t\\bar\{c\}\_\{i,k\}\\leftarrow\(1\-\\alpha\)\\,\\bar\{c\}\_\{i,k\}\+\\alpha\\,c\_\{i,t\}⊳\\trianglerightUpdate confidence

17:

ni,k←ni,k\+1n\_\{i,k\}\\leftarrow n\_\{i,k\}\+1
18:endfor

19:endfor

## 4Formal Properties

We state six propositions characterising the behaviour of MARGIN\. Proof sketches appear below; full proofs are in Appendix[A](https://arxiv.org/html/2605.22949#A1)\. Throughout, we consider a single \(agent, band\) pair and drop the subscriptsi,ki,kfor clarity\. LetX1,X2,…X\_\{1\},X\_\{2\},\\ldotsbe binary outcomes with𝔼​\[Xt\]=θ\\mathbb\{E\}\[X\_\{t\}\]=\\theta, and leta^0∈\[0,1\]\\hat\{a\}\_\{0\}\\in\[0,1\]be an arbitrary initialisation\.

###### Proposition 1\(EWMA as Exponential Discounting\)\.

The EWMA updatea^t=\(1−α\)​a^t−1\+α​Xt\\hat\{a\}\_\{t\}=\(1\-\\alpha\)\\,\\hat\{a\}\_\{t\-1\}\+\\alpha\\,X\_\{t\}admits the closed\-form expansion

a^t=\(1−α\)t​a^0\+α​∑τ=1t\(1−α\)t−τ​Xτ\.\\hat\{a\}\_\{t\}=\(1\-\\alpha\)^\{t\}\\,\\hat\{a\}\_\{0\}\+\\alpha\\sum\_\{\\tau=1\}^\{t\}\(1\-\\alpha\)^\{t\-\\tau\}\\,X\_\{\\tau\}\.\(10\)The weight assigned to observationXτX\_\{\\tau\}iswτ=α​\(1−α\)t−τw\_\{\\tau\}=\\alpha\(1\-\\alpha\)^\{t\-\\tau\}, which decays exponentially with age\. The weights sum to1−\(1−α\)t1\-\(1\-\\alpha\)^\{t\}, with the residual\(1−α\)t\(1\-\\alpha\)^\{t\}carried by the initialisation\. The effective memory window is approximately1/α1/\\alphaobservations\.

Proof sketch\.Direct induction on the recurrence\. The weight sum is a geometric series\. The effective window follows from\(1−α\)1/α→e−1\(1\-\\alpha\)^\{1/\\alpha\}\\to e^\{\-1\}, so observations older than1/α1/\\alphasteps contribute less thane−1e^\{\-1\}of the most recent observation’s weight\. ∎

###### Proposition 2\(Calibration Convergence\)\.

Under i\.i\.d\. outcomesXt∼Bernoulli​\(θ\)X\_\{t\}\\sim\\mathrm\{Bernoulli\}\(\\theta\), the EWMA estimator is asymptotically unbiased and converges to a stationary distribution:

𝔼​\[a^t\]\\displaystyle\\mathbb\{E\}\[\\hat\{a\}\_\{t\}\]=θ\+\(1−α\)t​\(a^0−θ\),\\displaystyle=\\theta\+\(1\-\\alpha\)^\{t\}\\,\(\\hat\{a\}\_\{0\}\-\\theta\),\(11\)Var​\(a^t\)\\displaystyle\\mathrm\{Var\}\(\\hat\{a\}\_\{t\}\)=α2−α​θ​\(1−θ\)​\[1−\(1−α\)2​t\]\.\\displaystyle=\\frac\{\\alpha\}\{2\-\\alpha\}\\,\\theta\(1\-\\theta\)\\,\\bigl\[1\-\(1\-\\alpha\)^\{2t\}\\bigr\]\.\(12\)In the limit,𝔼​\[a^t\]→θ\\mathbb\{E\}\[\\hat\{a\}\_\{t\}\]\\to\\thetaandVar​\(a^t\)→α2−α​θ​\(1−θ\)\\mathrm\{Var\}\(\\hat\{a\}\_\{t\}\)\\to\\frac\{\\alpha\}\{2\-\\alpha\}\\,\\theta\(1\-\\theta\)\.

Proof sketch\.Apply linearity of expectation to Eq\. \([10](https://arxiv.org/html/2605.22949#S4.E10)\) to obtain \([11](https://arxiv.org/html/2605.22949#S4.E11)\)\. For variance, use independence of theXτX\_\{\\tau\}to writeVar​\(a^t\)=θ​\(1−θ\)​∑τ=1twτ2\\mathrm\{Var\}\(\\hat\{a\}\_\{t\}\)=\\theta\(1\-\\theta\)\\sum\_\{\\tau=1\}^\{t\}w\_\{\\tau\}^\{2\}and evaluate the geometric sum of squared weights\. The bias\(1−α\)t​\(a^0−θ\)\(1\-\\alpha\)^\{t\}\(\\hat\{a\}\_\{0\}\-\\theta\)decays geometrically: atα=0\.04\\alpha=0\.04, the bias factor is0\.9650≈0\.130\.96^\{50\}\\approx 0\.13after 50 observations and0\.96100≈0\.0170\.96^\{100\}\\approx 0\.017after 100\. The steady\-state standard deviationα​θ​\(1−θ\)/\(2−α\)\\sqrt\{\\alpha\\,\\theta\(1\-\\theta\)/\(2\-\\alpha\)\}is irreducible and is the price of adaptability\. ∎

Empirical illustration\.Forα=0\.04\\alpha=0\.04andθ=0\.79\\theta=0\.79\(the observed mean accuracy in the high\-confidence band\), the predicted steady\-state standard deviation is0\.04×0\.79×0\.21/1\.96≈0\.058\\sqrt\{0\.04\\times 0\.79\\times 0\.21/1\.96\}\\approx 0\.058\. Our bootstrap convergence analysis across 100 question orderings shows an asymptotic standard deviation exceeding 0\.05 in this band, consistent with the theoretical prediction\.

###### Proposition 3\(Tracking Speed\)\.

Suppose the true accuracy rate shifts instantaneously fromθ\\thetatoθ′=θ\+Δ\\theta^\{\\prime\}=\\theta\+\\Deltaat timet0t\_\{0\}\. Afternnsubsequent observations drawn i\.i\.d\. fromBernoulli​\(θ′\)\\mathrm\{Bernoulli\}\(\\theta^\{\\prime\}\), the expected bias of the estimator relative toθ′\\theta^\{\\prime\}is

\|𝔼​\[a^t0\+n\]−θ′\|=\(1−α\)n​\|a^t0−θ′\|\.\|\\mathbb\{E\}\[\\hat\{a\}\_\{t\_\{0\}\+n\}\]\-\\theta^\{\\prime\}\|=\(1\-\\alpha\)^\{n\}\\,\|\\hat\{a\}\_\{t\_\{0\}\}\-\\theta^\{\\prime\}\|\.\(13\)To reduce this bias belowε\\varepsilon, the required number of observations is

n≥1α​ln⁡\(\|a^t0−θ′\|ε\)\+O​\(1\)\.n\\geq\\frac\{1\}\{\\alpha\}\\,\\ln\\\!\\left\(\\frac\{\|\\hat\{a\}\_\{t\_\{0\}\}\-\\theta^\{\\prime\}\|\}\{\\varepsilon\}\\right\)\+O\(1\)\.\(14\)

Proof sketch\.After the shift, the system is equivalent to a fresh EWMA initialised ata^t0\\hat\{a\}\_\{t\_\{0\}\}trackingθ′\\theta^\{\\prime\}\. Apply Eq\. \([11](https://arxiv.org/html/2605.22949#S4.E11)\) witha^0=a^t0\\hat\{a\}\_\{0\}=\\hat\{a\}\_\{t\_\{0\}\}\. The logarithmic inversion uses\(1−α\)n≤e−α​n\(1\-\\alpha\)^\{n\}\\leq e^\{\-\\alpha n\}\. This formalises why MARGIN adapts to distribution shift while design\-time methods, which have no forgetting mechanism, remain permanently miscalibrated after the shift\. ∎

###### Proposition 4\(Bias\-Variance Tradeoff\)\.

Consider a slowly drifting environment where the true accuracy rate changes at rateδ\\deltaper observation\. The expected steady\-state calibration error decomposes as

𝔼​\[\|a^t−θt\|\]≤δ2​α⏟tracking lag\+α​θ​\(1−θ\)2−α⏟estimation noise\.\\mathbb\{E\}\[\\,\|\\hat\{a\}\_\{t\}\-\\theta\_\{t\}\|\\,\]\\;\\leq\\;\\underbrace\{\\frac\{\\delta\}\{2\\alpha\}\}\_\{\\text\{tracking lag\}\}\\;\+\\;\\underbrace\{\\sqrt\{\\frac\{\\alpha\\,\\theta\(1\-\\theta\)\}\{2\-\\alpha\}\}\}\_\{\\text\{estimation noise\}\}\.\(15\)The tracking lag decreases withα\\alpha\(faster adaptation\), while the estimation noise increases withα\\alpha\(shorter effective window\)\. The minimum defines an optimalα∗\\alpha^\{\*\}for a given drift rateδ\\delta\.

Proof sketch\.The tracking lag arises because the EWMA lags behind a drifting target by approximately the drift accumulated over the effective window:δ×\(1/2​α\)\\delta\\times\(1/2\\alpha\)\. The estimation noise is the steady\-state standard deviation from Proposition[2](https://arxiv.org/html/2605.22949#Thmtheorem2)\. The bound follows from the triangle inequality applied to the bias and noise components\. Minimising overα\\alphagivesα∗=O​\(δ2/3\)\\alpha^\{\*\}=O\(\\delta^\{2/3\}\), but in practice the U\-shape is broad andα=0\.04\\alpha=0\.04works well across a range of drift rates \(Section[10\.1](https://arxiv.org/html/2605.22949#S10.SS1)\)\. ∎

###### Proposition 5\(Symmetric Optimality for Non\-Strategic Agents\)\.

Consider an asymmetric EWMA with learning ratesαup\\alpha\_\{\\mathrm\{up\}\}\(whenXt=1X\_\{t\}=1\) andαdown\\alpha\_\{\\mathrm\{down\}\}\(whenXt=0X\_\{t\}=0\)\. Under i\.i\.d\.Bernoulli​\(θ\)\\mathrm\{Bernoulli\}\(\\theta\)outcomes, the steady\-state expectation is

𝔼​\[a^∞\]=αup​θαup​θ\+αdown​\(1−θ\)\.\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=\\frac\{\\alpha\_\{\\mathrm\{up\}\}\\,\\theta\}\{\\alpha\_\{\\mathrm\{up\}\}\\,\\theta\+\\alpha\_\{\\mathrm\{down\}\}\\,\(1\-\\theta\)\}\.\(16\)This equalsθ\\thetaif and only ifαup=αdown\\alpha\_\{\\mathrm\{up\}\}=\\alpha\_\{\\mathrm\{down\}\}\. Otherwise, the asymptotic bias is

\|𝔼​\[a^∞\]−θ\|=θ​\(1−θ\)​\|αup−αdown\|αup​θ\+αdown​\(1−θ\)\.\|\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]\-\\theta\|=\\frac\{\\theta\(1\-\\theta\)\\,\|\\alpha\_\{\\mathrm\{up\}\}\-\\alpha\_\{\\mathrm\{down\}\}\|\}\{\\alpha\_\{\\mathrm\{up\}\}\\,\\theta\+\\alpha\_\{\\mathrm\{down\}\}\\,\(1\-\\theta\)\}\.\(17\)For agents with fixed policies whose confidence errors are epistemic \(zero\-mean\), symmetric EWMA is the minimum\-variance unbiased estimator within the EWMA family\.

Proof sketch\.The asymmetric EWMA defines a Markov chain on\[0,1\]\[0,1\]with state\-dependent transition rates\. The steady state satisfies𝔼​\[a^∞\]=\(1−αup\)​𝔼​\[a^∞∣X=1\]​θ\+\(1−αdown\)​𝔼​\[a^∞∣X=0\]​\(1−θ\)\+αup​θ\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=\(1\-\\alpha\_\{\\mathrm\{up\}\}\)\\,\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\\mid X=1\]\\,\\theta\+\(1\-\\alpha\_\{\\mathrm\{down\}\}\)\\,\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\\mid X=0\]\\,\(1\-\\theta\)\+\\alpha\_\{\\mathrm\{up\}\}\\,\\theta\. Solving for the fixed point yields Eq\. \([16](https://arxiv.org/html/2605.22949#S4.E16)\)\. The bias expression follows algebraically\.

As a numerical example: forθ=0\.8\\theta=0\.8,αup=0\.02\\alpha\_\{\\mathrm\{up\}\}=0\.02,αdown=0\.06\\alpha\_\{\\mathrm\{down\}\}=0\.06, the estimator converges to0\.016/0\.028≈0\.5710\.016/0\.028\\approx 0\.571rather than0\.800\.80\. This predicts the severe ECE degradation \(3–4×\\timesworse\) observed for all asymmetric configurations in Section[10\.3](https://arxiv.org/html/2605.22949#S10.SS3)\.

The argument for unbiasedness as optimality: foundation model agents have fixed policies and do not strategically adjust their confidence in response to calibration feedback\. Their miscalibration errors are epistemic, arising from the gap between internal representations and true task difficulty\. Under this condition, errors are approximately zero\-mean, and any biased estimator systematically over\- or under\-corrects, increasing ECE\. Symmetric EWMA is the unique unbiased member of the EWMA family\. ∎

###### Proposition 6\(Selection Monotonicity\)\.

ConsiderNNagents with true per\-task accuraciesp1\>p2≥⋯≥pNp\_\{1\}\>p\_\{2\}\\geq\\cdots\\geq p\_\{N\}, each expressing confidenceci=pi\+ηic\_\{i\}=p\_\{i\}\+\\eta\_\{i\}whereηi\\eta\_\{i\}is zero\-mean noise\. Under confidence\-weighted selection \(Eq\.[8](https://arxiv.org/html/2605.22949#S3.E8)\) with a single task, the probability of selecting the best agent increases monotonically as calibration error decreases\. Formally, letσ2\\sigma^\{2\}denote the variance of the calibration noise \(the residual error inc~i,t\\tilde\{c\}\_\{i,t\}relative topip\_\{i\}\)\. Then

∂∂σ2​ℙ​\[y^t=y^1,t\]≤0\.\\frac\{\\partial\}\{\\partial\\sigma^\{2\}\}\\mathbb\{P\}\\\!\\left\[\\hat\{y\}\_\{t\}=\\hat\{y\}\_\{1,t\}\\right\]\\leq 0\.\(18\)In particular, when raw confidence is anti\-correlated with true accuracy \(as observed empirically on hard benchmarks\), selection performs below random, and any calibration that reduces ECE must improve selection above this baseline\.

Proof sketch\.The selection rule chooses the agent with the highest calibrated confidence\. When calibration noiseσ2→0\\sigma^\{2\}\\to 0, the highest\-confidence agent is the most accurate with probability approaching 1\. Asσ2\\sigma^\{2\}increases, the selection becomes increasingly random\. The monotonicity follows from standard results on the probability that the maximum of correlated Gaussians corresponds to the highest\-mean component\.

The “worse than random” pattern on hard benchmarks \(pairwise resolution 45–56% with raw confidence\) occurs because foundation models are systematically more confident on problems they get wrong: weaker models express higher verbalized confidence on challenging tasks\. This creates negative correlation between confidence and accuracy, making confidence\-weighted selection actively harmful\. Any calibration method that corrects this negative correlation, reducing ECE and restoring the confidence–accuracy alignment, must improve selection above 50%\. ∎

## 5Experimental Setup

### 5\.1Models

We evaluate 19 foundation models spanning diverse architectures, scales, and access modes \(Table[2](https://arxiv.org/html/2605.22949#S5.T2)\)\. Ten models are accessed via cloud API \(Qwen, DeepSeek, GPT, MiniMax, GLM families\), and nine are run locally via Ollama at 4\-bit quantisation \(7B–72B parameters\)\. This mix ensures that MARGIN is tested across the heterogeneity typical of real multi\-agent deployments: different providers, architectures, quantisation levels, and inference regimes\.

Table 2:Model inventory\. Cloud models accessed via API; local models run via Ollama \(Q4\_K\_M quantisation\)\.ModelFamilyParametersAccessQwen3\-Coder\-480B\-A35BQwen 3480B \(35B active\)CloudQwen3\-32BQwen 332BCloudQwen2\.5\-32B\-InstructQwen 2\.532BCloudQwen2\.5\-14B\-Instruct\-1MQwen 2\.514BCloudDeepSeek\-R1\-Distill\-Qwen\-32BDeepSeek32BCloudDeepSeek\-V3\.2DeepSeekMoECloudgpt\-oss\-120bGPT120BCloudMiniMaxMiniMax—CloudGLM\-4\.7\-FlashGLM—CloudQwen3\-235BQwen 3235BCloudqwen2\.5:72bQwen 2\.572BLocalllama3\.1:70bLLaMA 3\.170BLocalcommand\-r:35bCommand R35BLocalgemma2:27bGemma 227BLocalqwen2\.5:14bQwen 2\.514BLocalphi4:14bPhi\-414BLocalgemma2:9bGemma 29BLocalllama3\.1:8bLLaMA 3\.18BLocalmistral:7bMistral7BLocal
### 5\.2Benchmarks

We evaluate on eight benchmarks spanning code generation, question answering, and mathematics \(Table[3](https://arxiv.org/html/2605.22949#S5.T3)\)\. Code generation benchmarks provide deterministic ground truth via execution, making them ideal for calibration evaluation\. The QA and mathematics benchmarks extend coverage to domains with different difficulty distributions\.

For distribution shift experiments, we pair an easy benchmark \(phase 1, calibration source\) with a harder benchmark \(phase 2, evaluation target\)\. MARGIN learns calibration factors online from the phase 1 stream, then continues learning on phase 2 without resetting\. Baselines are fitted on phase 1 data only, using a proper 50/50 calibration/evaluation split with 100 random shuffles\.

Table 3:Benchmark summary\. Shift pairs are indicated by arrows\.
### 5\.3Baselines

We compare MARGIN against four baselines:

- •Raw: Uncalibrated confidence, as stated by the model\.
- •Temperature scaling\[[12](https://arxiv.org/html/2605.22949#bib.bib1)\]: A single scalar temperatureTTfitted to minimise negative log\-likelihood on calibration data\. Applied to confidence scores post\-hoc\.
- •Platt scaling\[[26](https://arxiv.org/html/2605.22949#bib.bib2)\]: A logistic regression mapping confidence to calibrated probability\. Two parameters fitted on calibration data\.
- •Histogram binning: Non\-parametric calibration that replaces each bin’s mean confidence with its observed accuracy\. Fitted on calibration data\.

All baselines are fitted using a proper 50/50 calibration/evaluation split on phase 1 data\. We repeat with 100 random shuffles and report means\. This is the most favourable possible setup for design\-time methods: they see calibration data from the same distribution as evaluation\. In deployment, this assumption rarely holds\.

### 5\.4Evaluation Metrics

Expected Calibration Error \(ECE\)\.The standard measure of calibration quality\. We partition predictions into 10 equal\-width bins by confidence and compute the weighted average of per\-bin\|accuracy−confidence\|\|\\text\{accuracy\}\-\\text\{confidence\}\|:

ECE=∑b=110nbN​\|accb−confb\|\.\\text\{ECE\}=\\sum\_\{b=1\}^\{10\}\\frac\{n\_\{b\}\}\{N\}\\,\|\\text\{acc\}\_\{b\}\-\\text\{conf\}\_\{b\}\|\.\(19\)Lower is better\. ECE is a summary statistic over a binned reliability diagram rather than a strictly proper scoring rule in the sense of Gneiting and Raftery\[[11](https://arxiv.org/html/2605.22949#bib.bib7)\], but it remains the standard reporting measure in the calibration literature and is directly comparable across methods\. We report ECE on phase 2 \(post\-shift\) data for distribution shift experiments\.

pass@1\.For multi\-agent selection: the fraction of problems where the selected answer is correct\. The upper bound is the oracle \(best possible selection with perfect knowledge\), and the lower bound is random selection\.

Pairwise resolution\.Given two agents that disagree, the probability that the higher\-confidence agent is correct\. This isolates calibration quality on the disagreement cases where selection actually matters\. Random baseline is 50%\. Values below 50% indicate that confidence is anti\-correlated with accuracy\.

Statistical methodology\.For MARGIN, we report means and 95% bootstrap confidence intervals across 100 question\-ordering shuffles \(the EWMA is order\-dependent, so this captures the sensitivity to presentation order\)\. For baselines, the 100 shuffles correspond to different calibration/evaluation splits\.

## 6Distribution Shift Results

The distribution shift experiments test MARGIN’s core advantage: online adaptation to changing task distributions without a held\-out calibration set\. Models learn calibration on an easy benchmark \(phase 1\), then face a harder benchmark \(phase 2\)\. We report phase 2 ECE, which measures calibration quality after the shift\.

### 6\.1Code Generation

Table[4](https://arxiv.org/html/2605.22949#S6.T4)shows results across eight codegen shift conditions spanning three severity levels\. Under severe shift \(HumanEval or MBPP→\\toBigCodeBench or CodeContests\), design\-time baselines remain catastrophically miscalibrated with ECE 37–63, while MARGIN adapts online to ECE 6–11\. Under moderate shift \(→\\toLiveCodeBench\), MARGIN approximately halves the best baseline’s ECE \(7\.5 vs 13\.9\)\. Under mild shift \(MBPP→\\toMBPP\+\), MARGIN matches or beats the best baseline\.

Table 4:Distribution shift results: code generation\. Phase 2 ECE \(lower is better\)\. 10 cloud models for HE/MBPP/BCB shifts; 9 for CC/LCB shifts\.![Refer to caption](https://arxiv.org/html/2605.22949v1/x2.png)Figure 2:Per\-model raw ECE on HumanEval \(phase 1, mild regime\) versus BigCodeBench \(phase 2 target of the severe HE→\\toBCB shift\), across 21 foundation models\. Every model is poorly calibrated on the out\-of\-distribution benchmark \(ECE 40–80%\), regardless of family or size\. This motivates an online calibration layer that does not depend on any particular model’s raw calibration quality\.
### 6\.2Question Answering and Mathematics

MARGIN generalises beyond code generation\. Table[5](https://arxiv.org/html/2605.22949#S6.T5)shows results on three QA and mathematics shift conditions\. MARGIN achieves 3–4×\\timeslower ECE than the best static baseline on MMLU and MATH shifts\. TriviaQA \(temporal shift\) is the one case where histogram binning is competitive \(2\.3 vs 3\.9\), likely because the shift is mild and the held\-out calibration data remains informative\.

Table 5:Distribution shift results: QA and mathematics\. Phase 2 ECE \(lower is better\)\.![Refer to caption](https://arxiv.org/html/2605.22949v1/x3.png)Figure 3:Reliability diagrams on the MMLU \(STEM→\\toHumanities\) shift\. Left: raw verbalized confidence is systematically overconfident, with reliability curves lying far below the diagonal across all confidence bins \(ECE7\.3%→18\.5%7\.3\\%\\to 18\.5\\%under shift\)\. Right: MARGIN\-calibrated confidence tracks the diagonal closely in both phases, reducing ECE by4×4\\timespost\-shift \(2\.7%→4\.6%2\.7\\%\\to 4\.6\\%\)\.
### 6\.3Analysis

The pattern across all 11 shift conditions is consistent: MARGIN’s advantage scales with shift severity \(Figure[4](https://arxiv.org/html/2605.22949#S6.F4)\)\. Under severe shift, where the gap between calibration\-time and deployment\-time distributions is largest, design\-time methods have no mechanism to adapt and remain permanently miscalibrated\. MARGIN’s exponential forgetting \(Proposition[1](https://arxiv.org/html/2605.22949#Thmtheorem1)\) allows it to discount stale calibration data and track the new distribution\.

![Refer to caption](https://arxiv.org/html/2605.22949v1/x4.png)Figure 4:Phase 2 ECE across all 11 distribution\-shift conditions \(8 code\-generation \+ 3 QA/math\), comparing Raw verbalized confidence, Temperature scaling, Platt scaling, Histogram binning, and MARGIN\. Background shading indicates shift severity \(severe/moderate/mild\)\. MARGIN’s relative advantage grows monotonically with shift severity: under severe shift, all design\-time baselines remain catastrophically miscalibrated while MARGIN adapts online to single\-digit ECE\.The recovery dynamics match the theoretical prediction of Proposition[3](https://arxiv.org/html/2605.22949#Thmtheorem3): after an abrupt shift, MARGIN’s bias decays exponentially at rateα\\alpha, reaching practical calibration within approximately1/α≈251/\\alpha\\approx 25observations per band\. The bias\-variance tradeoff of Proposition[4](https://arxiv.org/html/2605.22949#Thmtheorem4)is visible in the mild shift regime, where MARGIN’s estimation noise slightly exceeds the best static baseline’s ECE\. This is the expected cost of adaptability: when the environment happens to match the calibration set, a fixed correction is slightly more precise than an adaptive one\. In practice, this case is rare\.

## 7Multi\-Agent Selection Results

The second contribution of calibrated confidence is improved multi\-agent selection: choosing which agent’s answer to trust when multiple agents respond to the same task\. Table[6](https://arxiv.org/html/2605.22949#S7.T6)summarises results across four code generation benchmarks spanning easy \(HumanEval\) to hard \(CodeContests\)\.

Table 6:Multi\-agent selection summary \(pass@1, %\)\. Best model = always selecting the single best\-performing model\.The pattern is striking: the harder the benchmark, the more MARGIN matters\. On easy benchmarks where top models are near\-perfect, calibration provides a modest improvement\. On hard benchmarks where no single model dominates, raw confidence actively hurts selection, while MARGIN recovers and closes the gap to oracle\.

### 7\.1The Confidence Inversion Problem

The most surprising finding is that raw verbalized confidence is*worse than random*at pairwise resolution on three of four benchmarks\. Table[7](https://arxiv.org/html/2605.22949#S7.T7)shows the pairwise resolution rates\.

Table 7:Pairwise resolution \(%\): probability of selecting the correct answer when two agents disagree\. Random baseline is 50%\.![Refer to caption](https://arxiv.org/html/2605.22949v1/x5.png)Figure 5:Multi\-agent selection results\. Left: pass@1 \(%\) across four code\-generation benchmarks, comparing random selection, the always\-best single model, raw verbalized and consistency confidence, majority vote, MARGIN \(verbalized and consistency\), and an oracle upper bound\. Right: pairwise resolution \(%\) when two agents disagree, with the 50% random baseline marked\. Raw verbalized confidence falls below random on three of four hard benchmarks \(confidence inversion\); MARGIN restores pairwise resolution to 70–89%\.On LiveCodeBench \(45\.2%\) and BigCodeBench \(44\.8%\), selecting the more confident agent when two agents disagree is worse than a coin flip\. This occurs because weaker models tend to express higher verbalized confidence on hard problems\. The result is confidence inversion: the confidence signal is anti\-correlated with accuracy, making confidence\-weighted selection actively harmful\.

MARGIN corrects this completely\. After online calibration, pairwise resolution rises to 70–89% across all benchmarks\. This is consistent with the theoretical prediction of Proposition[6](https://arxiv.org/html/2605.22949#Thmtheorem6): any calibration that restores the correlation between confidence and accuracy must improve selection above 50%\.

### 7\.2Convergence Analysis

MARGIN’s selection advantage emerges rapidly\. Table[8](https://arxiv.org/html/2605.22949#S7.T8)shows pass@1 as a function of problems seen, using verbalized confidence\.

Table 8:Convergence of MARGIN selection \(verbalized pass@1, %\) by number of problems seen\.MARGIN surpasses raw confidence within 10–30 problems across all benchmarks\. On CodeContests, MARGIN already matches the always\-best model \(40\.0%\) by problem 50\. LiveCodeBench, with 880 problems, provides the most robust convergence test: MARGIN reaches near\-peak performance by approximately 50 problems and sustains it stably across the remaining 830\. This convergence speed is consistent with the effective sample size of approximately1/α=251/\\alpha=25observations per band predicted by Proposition[1](https://arxiv.org/html/2605.22949#Thmtheorem1)\.

## 8Cross\-Task Calibration Transfer

Can calibration factors learned on one benchmark transfer to a different benchmark? Table[9](https://arxiv.org/html/2605.22949#S8.T9)evaluates eight transfer directions between codegen benchmarks, comparing raw \(uncalibrated\), transferred MARGIN factors \(frozen from the source benchmark, no further updates\), and from\-scratch MARGIN \(learning directly on the target\)\.

Table 9:Cross\-task calibration transfer \(mean ECE across 9–10 cloud models\)\. Lower is better\.![Refer to caption](https://arxiv.org/html/2605.22949v1/x6.png)Figure 6:Cross\-task calibration transfer \(mean across 9–10 cloud models\)\. Left: phase 2 ECE under three regimes \(Raw, Transferred factors frozen from the source benchmark, and From\-scratch online learning on the target\)\. Right: transfer penalty, the ECE gap between Transferred and From\-scratch\. Transferred factors always beat raw, but online adaptation on the target is33–4×4\\timesbetter, illustrating that calibration is distribution\-specific rather than model\-specific\.Transferred factors always improve over raw \(e\.g\., 66\.5→\\to57\.6 on HE→\\toBCB\), illustrating that MARGIN learns something generalisable about each model’s calibration tendencies\. However, from\-scratch adaptation that learns directly on the target distribution is 3–4×\\timesbetter \(57\.6→\\to14\.4\)\. This illustrates the central argument: online adaptation to the specific deployment distribution is fundamentally superior to any pre\-computed correction, even one derived by MARGIN itself on a related task\.

This result connects directly to Proposition[2](https://arxiv.org/html/2605.22949#Thmtheorem2): the EWMA converges to the*distribution\-specific*accuracy rateθ\\theta, not a generic correction factor\. When the distribution changes, the optimal factor changes, and only online learning can track it\.

## 9Robustness to Dynamic Agent Pools

Deployed multi\-agent systems rarely operate on a fixed roster\. Agents are added as new foundation models become available, removed for cost or quality reasons, or cycled in and out as the coordinator rebalances capacity\. A calibration method that only works on a static pool is of limited practical use\. This section evaluates MARGIN under three pool\-dynamic scenarios, using the 11 cloud models with full coverage across the three QA and math shift datasets \(MMLU\-shift, TriviaQA, MATH\-shift\), 50 shuffles per scenario, 1000 observations per run\. Figure[7](https://arxiv.org/html/2605.22949#S9.F7)summarises the results\.

![Refer to caption](https://arxiv.org/html/2605.22949v1/x7.png)Figure 7:Robustness of MARGIN to dynamic agent pools\. 11 cloud models with full QA and math shift coverage, 50 shuffles per scenario, 1000 observations each\. \(a\) Agent dropout: the two most\-observed models are removed at observation 500 \(dashed line\); ensemble ECE \(50\-observation window\) remains stable, with post\-drop mean 9\.66% versus pre\-drop 11\.86%\. \(b\) Cold start: 4 established models run for 500 observations, then 7 newcomers join; newcomer cumulative ECE under hierarchical shrinkage \(ks=18k\_\{s\}=18\) tracks the established\-model reference from the first 50\-observation checkpoint, while raw newcomer ECE requires over 200 observations to converge\. \(c\) Rolling replacement: the worst\-calibrated active model is swapped every 200 observations \(dashed lines\); ECE is stable across all four swaps\. Shaded bands are±\\pm1 standard deviation across 50 shuffles\.Scenario 1, agent dropout\.At observation 500, the two models with the most phase\-1 observations are removed\. MARGIN’s remaining calibrators continue to operate unchanged\. Ensemble ECE descends from roughly 18% at initialisation to around 10% by observation 200, independent of the drop event\. Post\-drop mean ECE is 9\.66% \(std 3\.55\) compared with 11\.86% pre\-drop \(std 5\.03\)\. The band\-level statistics for surviving models are unaffected by the removal, so there is no spike at the drop event in Figure[7](https://arxiv.org/html/2605.22949#S9.F7)\(a\)\.

Scenario 2, cold start\.Phase 1 runs for 500 observations with 4 established models\. At observation 500, 7 newcomer models join with no prior calibration history\. We compare two strategies\. The first blends each newcomer’s band factor toward the current pool\-wide band average using hierarchical shrinkage withks=18k\_\{s\}=18\(Section[3\.4](https://arxiv.org/html/2605.22949#S3.SS4)\)\. The second uses raw per\-agent EWMA without shrinkage\. Figure[7](https://arxiv.org/html/2605.22949#S9.F7)\(b\) plots newcomer cumulative ECE as a function of newcomer observations\. With shrinkage, newcomers reach 7\.74% ECE after 50 observations and 5\.12% after 200, closely tracking the established\-model reference at 7\.10% mean\. Without shrinkage, newcomers start at 13\.67% ECE and require more than 200 observations to reach single\-digit error\. Shrinkage reduces cold\-start ECE by 41–44% across the first four checkpoints\.

Scenario 3, rolling replacement\.Every 200 observations, the worst\-calibrated active model \(by recent 200\-observation ECE\) is swapped out, and a previously removed model returns\. After the initial cold\-start segment \(observations 50–200, mean ECE 14\.45%\), ensemble ECE is stable across all four swap points: 10\.34%, 10\.45%, 10\.36%, and 10\.30% for segments 250–400, 450–600, 650–800, and 850–1000 respectively\. Variance is 3\.6–4\.2% std across all post\-warmup segments\.

The three scenarios cover the dominant failure modes for a calibration layer under pool change, catastrophic loss of calibrated statistics when an agent leaves \(scenario 1\), slow convergence for new entrants \(scenario 2\), and accumulation of drift over repeated composition changes \(scenario 3\)\. MARGIN’s band\-level, per\-agent statistics localise the effect of each change\. Hierarchical shrinkage addresses the cold\-start gap with a single hyperparameter\. No additional mechanism is required for the rolling case, as the EWMA’s forgetting rate naturally discounts the contribution of removed agents’ prior updates\.

## 10Ablation Studies

We ablate each of MARGIN’s three hyperparameters \(α\\alpha, band count,ksk\_\{s\}\) and the symmetric\-vs\-asymmetric design choice\. All ablations use three representative shift conditions: HE→\\toBCB \(severe\), MBPP→\\toCC \(moderate\-severe\), and MBPP→\\toMBPP\+ \(mild\)\.

### 10\.1EWMA Learning Rate

Table[10](https://arxiv.org/html/2605.22949#S10.T10)shows ECE across sevenα\\alphavalues\. The U\-shape predicted by Proposition[4](https://arxiv.org/html/2605.22949#Thmtheorem4)is clearly visible: too\-lowα\\alpha\(slow adaptation\) yields high ECE under severe shift, while too\-highα\\alpha\(noisy estimates\) degrades mild shift\. The defaultα=0\.04\\alpha=0\.04provides the best balance\. Note thatα=0\.08\\alpha=0\.08is slightly better on severe shift but 1\.4×\\timesworse on mild shift\.

Table 10:Ablation: EWMA learning rateα\\alpha\(phase 2 ECE\)\.
### 10\.2Confidence Band Count

Table[11](https://arxiv.org/html/2605.22949#S10.T11)varies the number of equal\-width confidence bands from 1 \(a single calibration factor per model\) to 10\. Fewer bands are better under severe shift, where each band must re\-learn its factor from limited data\. More bands provide slightly finer correction under mild shift\. The sweet spot is 1–3 bands with the data volumes in our experiments\. Over longer horizons with more observations per band, finer partitions would likely outperform\.

Table 11:Ablation: number of confidence bands \(phase 2 ECE\)\.
### 10\.3Asymmetric Learning Rate

Table[12](https://arxiv.org/html/2605.22949#S10.T12)compares the default symmetric update \(α=0\.04\\alpha=0\.04for both correct and incorrect outcomes\) against five asymmetric configurations\. In each,αc\\alpha\_\{c\}is the learning rate after correct predictions andαi\\alpha\_\{i\}after incorrect ones\.

Symmetricα\\alphawins across the board\. Configurations that penalise overconfidence faster \(highαi\\alpha\_\{i\}\) help slightly on severe shift but catastrophically hurt mild shift \(MBPP→\\toMBPP\+ ECE rises from 4\.5 to 30\.3 withαc=0\.02,αi=0\.08\\alpha\_\{c\}=0\.02,\\alpha\_\{i\}=0\.08\)\. This is precisely the bias predicted by Proposition[5](https://arxiv.org/html/2605.22949#Thmtheorem5): asymmetric rates introduce systematic underconfidence, and the resulting bias compounds across all agents and bands\.

Table 12:Ablation: asymmetric learning rate \(phase 2 ECE\)\.αc\\alpha\_\{c\}= correct,αi\\alpha\_\{i\}= incorrect\.
### 10\.4Bayesian Shrinkage

Table[13](https://arxiv.org/html/2605.22949#S10.T13)varies the shrinkage constantksk\_\{s\}from 0 \(no blending, pure band\-level factors\) to 1000\. Blending consistently helps under moderate\-to\-severe shift, the regime most relevant to real\-world deployment\. The MBPP→\\toCC condition \(closest to production\-scale drift\) plateaus atks≈100k\_\{s\}\\approx 100with a 40% improvement over no blending \(ECE 3\.8 vs 6\.3\)\. Mild shift is neutral up toks=100k\_\{s\}=100, then very slightly degrades, an acceptable tradeoff\.

Table 13:Ablation: Bayesian shrinkage constantksk\_\{s\}\(phase 2 ECE\)\.Recommended configuration\.Based on these ablations, we recommendα=0\.04\\alpha=0\.04,K=3K=3bands, andks=100k\_\{s\}=100as production defaults\. The no\-blending case \(ks=0k\_\{s\}=0\) already beats all design\-time baselines; shrinkage provides further improvement in the moderate\-to\-severe shift regime that dominates practical deployment\. Figure[8](https://arxiv.org/html/2605.22949#S10.F8)summarises all four ablations across the three representative shift conditions\.

![Refer to caption](https://arxiv.org/html/2605.22949v1/x8.png)Figure 8:Ablations across three representative shift conditions \(HE→\\toBCB severe, MBPP→\\toCC moderate, MBPP→\\toMBPP\+ mild\)\. \(a\) EWMA learning rateα\\alpha: the bias\-variance U\-shape of Proposition[4](https://arxiv.org/html/2605.22949#Thmtheorem4)is visible, withα=0\.04\\alpha\{=\}0\.04balancing severe and mild regimes\. \(b\) Confidence band countKK: fewer bands suffice at current data volumes; finer partitions would benefit from more per\-band observations\. \(c\) Asymmetric learning rate: all asymmetric configurations are worse than symmetric, with mild\-shift ECE rising sharply whenαi\>αc\\alpha\_\{i\}\>\\alpha\_\{c\}\. \(d\) Bayesian shrinkage constantksk\_\{s\}: blending plateaus nearks=100k\_\{s\}\{=\}100on moderate shift, with mild shift essentially neutral\.

## 11Discussion

Why symmetric works\.The most counterintuitive finding is that symmetric EWMA decisively outperforms all asymmetric configurations\. In strategic settings \(game theory, mechanism design\), asymmetric penalties are standard: penalising defection faster than rewarding cooperation prevents exploitation\. Foundation model agents, however, are not strategic\. They have fixed policies and do not adjust their confidence in response to calibration feedback\. Their errors are epistemic, arising from the gap between training distribution and deployment distribution\. Proposition[5](https://arxiv.org/html/2605.22949#Thmtheorem5)shows that under these conditions, asymmetric updates introduce systematic bias, and the empirical results confirm it decisively\.

This distinction between strategic and non\-strategic agents may become less clear as foundation models gain the ability to observe and respond to their own reputation signals\. The symmetric EWMA is the correct choice for current deployment, but future systems with truly adaptive agents may require the asymmetric approach\.

The confidence inversion problem\.Perhaps the most practically important finding is that raw verbalized confidence is worse than random on hard benchmarks \(pairwise resolution 44\.8–55\.5%\)\. This means that naively trusting model confidence in multi\-agent systems actively degrades performance\. The mechanism is straightforward: weaker models tend to express higher confidence on problems they cannot solve\. This is consistent with the Dunning–Kruger pattern observed in human metacognition, though the underlying cause is different \(training distribution mismatch rather than cognitive bias\)\. MARGIN’s online calibration learns this pattern and corrects it, restoring the correlation between confidence and accuracy\.

Practical deployment\.MARGIN requires approximately 30–50 observations per agent to reach practical calibration, consistent with the effective sample size of∼1/α=25\\sim\\\!1/\\alpha=25per band from Proposition[1](https://arxiv.org/html/2605.22949#Thmtheorem1)\. Hierarchical shrinkage shortens the cold\-start period substantially\. Section[9](https://arxiv.org/html/2605.22949#S9)shows that withks=18k\_\{s\}=18, newcomer agents reach 7\.74% ECE after 50 observations and track the established\-model reference within the first checkpoint, while raw per\-agent EWMA requires more than 200 observations to reach comparable error\. The same section demonstrates stability under mid\-stream agent removal and repeated composition change\. Computational overhead is negligible: one EWMA update per observation per \(agent, band\) pair\.

Limitations\.Any calibration method that maps confidence to accuracy must observe accuracy at some point\. The question is whether that observation happens once on a held\-out set, or continuously from the deployment stream\. MARGIN occupies the latter regime\. MARGIN requires binary outcome signals \(correct/incorrect\), limiting its applicability to tasks with verifiable answers\. Extension to graded or partial\-credit outcomes is straightforward \(replace the binary update with a continuous outcome\) but is not evaluated here\. The number of confidence bands is a design choice with no principled selection criterion beyond the empirical ablation\. The i\.i\.d\. assumption within each band is violated when consecutive tasks are correlated \(e\.g\., topic clusters\), though the shuffle robustness analysis \(standard deviation 0\.13 across 100 orderings\) suggests this violation has limited practical impact\.

## 12Conclusion

We presented MARGIN, an online confidence calibration method for multi\-agent foundation model systems\. MARGIN uses per\-band exponentially weighted moving averages to learn calibration factors from the task stream itself, requiring no held\-out calibration data, no model access, and no retraining\. The method has three hyperparameters with robust defaults \(α=0\.04\\alpha=0\.04,K=3K=3bands,ks=100k\_\{s\}=100\) and negligible computational overhead\.

Across 19 models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3–6×\\timeslower calibration error than the best design\-time baselines under distribution shift\. In multi\-agent selection, MARGIN corrects the confidence inversion problem, where raw confidence is anti\-correlated with accuracy on hard benchmarks, raising pairwise resolution from 45–56% \(worse than random\) to 70–89%\. MARGIN\-calibrated selection surpasses the always\-best\-model baseline on three of four benchmarks\.

Six formal propositions characterise MARGIN’s convergence, tracking speed, bias\-variance tradeoff, and the optimality of symmetric updates for non\-strategic agents\. The theoretical predictions are illustrated by the empirical results throughout\.

The key finding is that foundation model confidence, as currently expressed, is not merely imprecise but actively misleading in multi\-agent coordination\. Runtime calibration is not an enhancement; it is a prerequisite for reliable multi\-agent systems\.

## References

- \[1\]A\. N\. Angelopoulos and S\. Bates\(2023\)A gentle introduction to conformal prediction and distribution\-free uncertainty quantification\.Foundations and Trends in Machine Learning16\(4\),pp\. 494–591\.Note:arXiv:2107\.07511Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p2.1)\.
- \[2\]N\. Cesa\-Bianchi and G\. Lugosi\(2006\)Prediction, learning, and games\.Cambridge University Press\.Cited by:[§2\.5](https://arxiv.org/html/2605.22949#S2.SS5.p1.2)\.
- \[3\]L\. Chen, M\. Zaharia, and J\. Zou\(2024\)FrugalGPT: how to use large language models while reducing cost and improving performance\.Transactions on Machine Learning Research\.Note:arXiv:2305\.05176Cited by:[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p3.1)\.
- \[4\]A\. P\. Dawid\(1982\)The well\-calibrated Bayesian\.Journal of the American Statistical Association77\(379\),pp\. 605–610\.Cited by:[§2\.5](https://arxiv.org/html/2605.22949#S2.SS5.p1.2)\.
- \[5\]Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch\(2024\)Improving factuality and reasoning in language models through multiagent debate\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p1.1)\.
- \[6\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625–630\.Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p3.1)\.
- \[7\]D\. P\. Foster and R\. V\. Vohra\(1998\)Asymptotic calibration\.Biometrika85\(2\),pp\. 379–390\.Cited by:[§2\.5](https://arxiv.org/html/2605.22949#S2.SS5.p1.2)\.
- \[8\]J\. Geng, F\. Cai, Y\. Wang, H\. Koeppl, P\. Nakov, and I\. Gurevych\(2024\)A survey of confidence estimation and calibration in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),pp\. 6577–6595\.Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p2.1)\.
- \[9\]W\. Gerych, Y\. Rizk, V\. Isahagian, V\. Muthusamy, E\. Duesterwald, and P\. Venkateswaran\(2024\)Who knows the answer? finding the best model and prompt for each query using confidence\-based search\.InProceedings of the 38th AAAI Conference on Artificial Intelligence,Cited by:[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p3.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.7.6.1)\.
- \[10\]I\. Gibbs and E\. Candès\(2021\)Adaptive conformal inference under distribution shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.5](https://arxiv.org/html/2605.22949#S2.SS5.p2.1)\.
- \[11\]T\. Gneiting and A\. E\. Raftery\(2007\)Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association102\(477\),pp\. 359–378\.Cited by:[§5\.4](https://arxiv.org/html/2605.22949#S5.SS4.p1.2)\.
- \[12\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 1321–1330\.Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p2.1),[§1](https://arxiv.org/html/2605.22949#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.2.1.1),[2nd item](https://arxiv.org/html/2605.22949#S5.I1.i2.p1.1)\.
- \[13\]J\. S\. Hunter\(1986\)The exponentially weighted moving average\.Journal of Quality Technology18\(4\),pp\. 203–210\.Cited by:[§2\.5](https://arxiv.org/html/2605.22949#S2.SS5.p2.1)\.
- \[14\]A\. Jøsang, R\. Ismail, and C\. Boyd\(2007\)A survey of trust and reputation systems for online service provision\.Decision Support Systems43\(2\),pp\. 618–644\.Cited by:[§2\.4](https://arxiv.org/html/2605.22949#S2.SS4.p1.1)\.
- \[15\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DaSilva, E\. Elhage,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p1.1)\.
- \[16\]S\. D\. Kamvar, M\. T\. Schlosser, and H\. Garcia\-Molina\(2003\)EigenTrust: reputation management in P2P networks\.InProceedings of the 12th International Conference on World Wide Web \(WWW\),pp\. 640–651\.Cited by:[§2\.4](https://arxiv.org/html/2605.22949#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.8.7.1)\.
- \[17\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InProceedings of the 11th International Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p3.1)\.
- \[18\]E\. La Malfa, G\. La Malfa, S\. Marro, J\. M\. Zhang, E\. Black, M\. Luck, P\. Torr, and M\. Wooldridge\(2025\)Large language models miss the multi\-agent mark\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Position Track,Note:arXiv:2505\.21298Cited by:[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p1.1)\.
- \[19\]B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell\(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p3.1)\.
- \[20\]Z\. Liet al\.\(2025\)ConfTuner: LLM self\-calibration via confidence tuning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.6.5.1)\.
- \[21\]X\. Liu, T\. Chen, L\. Da, C\. Chen, Z\. Lin, and H\. Wei\(2025\)Uncertainty quantification and confidence calibration in large language models: a survey\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining,Note:arXiv:2503\.15850Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p2.1)\.
- \[22\]B\. Luo, S\. Wang, S\. Li, and H\. Wei\(2025\)Your pre\-trained LLM is secretly an unsupervised confidence calibrator\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2505\.16690Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p2.1)\.
- \[23\]M\. Minderer, J\. Djolonga, R\. Romijnders, F\. Hubis, X\. Zhai, N\. Houlsby, D\. Tran, and M\. Lucic\(2021\)Revisiting the calibration of modern neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34\.Cited by:[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p1.1)\.
- \[24\]M\. P\. Naeini, G\. F\. Cooper, and M\. Hauskrecht\(2015\)Obtaining well calibrated probabilities using Bayesian binning into quantiles\.InProceedings of the 29th AAAI Conference on Artificial Intelligence,pp\. 2901–2907\.Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.4.3.1)\.
- \[25\]Y\. Ovadia, E\. Fertig, J\. Ren, Z\. Nado, D\. Sculley, S\. Nowozin, J\. V\. Dillon, B\. Lakshminarayanan, and J\. Snoek\(2019\)Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p3.1)\.
- \[26\]J\. C\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.InAdvances in Large Margin Classifiers,pp\. 61–74\.Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.3.2.1),[3rd item](https://arxiv.org/html/2605.22949#S5.I1.i3.p1.1)\.
- \[27\]M\. Shen, S\. Das, K\. Greenewald, P\. Sattigeri, G\. Wornell, and S\. Ghosh\(2024\)Thermometer: towards universal calibration for large language models\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22949#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.22949#S2.T1.1.5.4.1)\.
- \[28\]A\. Smit, P\. Duckworth, N\. Grinsztajn, T\. D\. Barrett, and A\. Pretorius\(2024\)Should we be going MAD? a look at multi\-agent debate strategies for LLMs\.arXiv preprint arXiv:2311\.17371\.Cited by:[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p1.1)\.
- \[29\]K\. Tian, E\. Mitchell, H\. Yao, C\. D\. Manning, and C\. Finn\(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Note:arXiv:2305\.14975Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p3.1)\.
- \[30\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen\(2024\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\)\.Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p1.1)\.
- \[31\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InProceedings of the 11th International Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p3.1)\.
- \[32\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang\(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.InCOLM 2024,Note:arXiv:2308\.08155Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p1.1)\.
- \[33\]M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi\(2024\)Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Note:arXiv:2306\.13063Cited by:[§1](https://arxiv.org/html/2605.22949#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22949#S2.SS2.p1.1),[§3\.6](https://arxiv.org/html/2605.22949#S3.SS6.p2.1)\.
- \[34\]Y\. Zhou and Y\. Chen\(2025\)Adaptive heterogeneous multi\-agent debate for enhanced educational and factual reasoning in large language models\.Journal of King Saud University – Computer and Information Sciences\.Cited by:[§2\.3](https://arxiv.org/html/2605.22949#S2.SS3.p2.1)\.

## Appendix AProofs of Formal Properties

### Proof of Proposition[1](https://arxiv.org/html/2605.22949#Thmtheorem1)\(EWMA as Exponential Discounting\)

We proceed by induction\. The base caset=1t=1givesa^1=\(1−α\)​a^0\+α​X1\\hat\{a\}\_\{1\}=\(1\-\\alpha\)\\hat\{a\}\_\{0\}\+\\alpha X\_\{1\}, matching Eq\. \([10](https://arxiv.org/html/2605.22949#S4.E10)\)\. For the inductive step, assume the expansion holds fort−1t\-1:

a^t\\displaystyle\\hat\{a\}\_\{t\}=\(1−α\)​a^t−1\+α​Xt\\displaystyle=\(1\-\\alpha\)\\hat\{a\}\_\{t\-1\}\+\\alpha X\_\{t\}=\(1−α\)​\[\(1−α\)t−1​a^0\+α​∑τ=1t−1\(1−α\)t−1−τ​Xτ\]\+α​Xt\\displaystyle=\(1\-\\alpha\)\\\!\\left\[\(1\-\\alpha\)^\{t\-1\}\\hat\{a\}\_\{0\}\+\\alpha\\sum\_\{\\tau=1\}^\{t\-1\}\(1\-\\alpha\)^\{t\-1\-\\tau\}X\_\{\\tau\}\\right\]\+\\alpha X\_\{t\}=\(1−α\)t​a^0\+α​∑τ=1t−1\(1−α\)t−τ​Xτ\+α​Xt\\displaystyle=\(1\-\\alpha\)^\{t\}\\hat\{a\}\_\{0\}\+\\alpha\\sum\_\{\\tau=1\}^\{t\-1\}\(1\-\\alpha\)^\{t\-\\tau\}X\_\{\\tau\}\+\\alpha X\_\{t\}=\(1−α\)t​a^0\+α​∑τ=1t\(1−α\)t−τ​Xτ\.\\displaystyle=\(1\-\\alpha\)^\{t\}\\hat\{a\}\_\{0\}\+\\alpha\\sum\_\{\\tau=1\}^\{t\}\(1\-\\alpha\)^\{t\-\\tau\}X\_\{\\tau\}\.
The weight on observationXτX\_\{\\tau\}iswτ=α​\(1−α\)t−τw\_\{\\tau\}=\\alpha\(1\-\\alpha\)^\{t\-\\tau\}\. The sum of weights is

∑τ=1tα​\(1−α\)t−τ=α⋅1−\(1−α\)t1−\(1−α\)=1−\(1−α\)t,\\sum\_\{\\tau=1\}^\{t\}\\alpha\(1\-\\alpha\)^\{t\-\\tau\}=\\alpha\\cdot\\frac\{1\-\(1\-\\alpha\)^\{t\}\}\{1\-\(1\-\\alpha\)\}=1\-\(1\-\\alpha\)^\{t\},with the complement\(1−α\)t\(1\-\\alpha\)^\{t\}carried by the initialisationa^0\\hat\{a\}\_\{0\}\. Total weight is exactly 1\.

For the effective window: the ratio of the weight on an observationnnsteps in the past to the most recent observation is\(1−α\)n\(1\-\\alpha\)^\{n\}\. Setting this equal toe−1e^\{\-1\}and solving:n=−1/ln⁡\(1−α\)≈1/αn=\-1/\\ln\(1\-\\alpha\)\\approx 1/\\alphafor smallα\\alpha\. Atα=0\.04\\alpha=0\.04, this gives an effective window of≈25\\approx 25observations\. ∎

### Proof of Proposition[2](https://arxiv.org/html/2605.22949#Thmtheorem2)\(Calibration Convergence\)

Mean\.Taking expectations in Eq\. \([10](https://arxiv.org/html/2605.22949#S4.E10)\) and using𝔼​\[Xτ\]=θ\\mathbb\{E\}\[X\_\{\\tau\}\]=\\theta:

𝔼​\[a^t\]\\displaystyle\\mathbb\{E\}\[\\hat\{a\}\_\{t\}\]=\(1−α\)t​a^0\+α​θ​∑τ=1t\(1−α\)t−τ\\displaystyle=\(1\-\\alpha\)^\{t\}\\hat\{a\}\_\{0\}\+\\alpha\\theta\\sum\_\{\\tau=1\}^\{t\}\(1\-\\alpha\)^\{t\-\\tau\}=\(1−α\)t​a^0\+θ​\[1−\(1−α\)t\]\\displaystyle=\(1\-\\alpha\)^\{t\}\\hat\{a\}\_\{0\}\+\\theta\\,\[1\-\(1\-\\alpha\)^\{t\}\]=θ\+\(1−α\)t​\(a^0−θ\)\.\\displaystyle=\\theta\+\(1\-\\alpha\)^\{t\}\(\\hat\{a\}\_\{0\}\-\\theta\)\.Ast→∞t\\to\\infty,\(1−α\)t→0\(1\-\\alpha\)^\{t\}\\to 0, so𝔼​\[a^t\]→θ\\mathbb\{E\}\[\\hat\{a\}\_\{t\}\]\\to\\theta\.

Variance\.By independence of theXτX\_\{\\tau\}:

Var​\(a^t\)\\displaystyle\\mathrm\{Var\}\(\\hat\{a\}\_\{t\}\)=∑τ=1twτ2​Var​\(Xτ\)=θ​\(1−θ\)​α2​∑τ=1t\(1−α\)2​\(t−τ\)\\displaystyle=\\sum\_\{\\tau=1\}^\{t\}w\_\{\\tau\}^\{2\}\\,\\mathrm\{Var\}\(X\_\{\\tau\}\)=\\theta\(1\-\\theta\)\\,\\alpha^\{2\}\\sum\_\{\\tau=1\}^\{t\}\(1\-\\alpha\)^\{2\(t\-\\tau\)\}=θ​\(1−θ\)​α2⋅1−\(1−α\)2​t1−\(1−α\)2\\displaystyle=\\theta\(1\-\\theta\)\\,\\alpha^\{2\}\\cdot\\frac\{1\-\(1\-\\alpha\)^\{2t\}\}\{1\-\(1\-\\alpha\)^\{2\}\}=α2−α​θ​\(1−θ\)​\[1−\(1−α\)2​t\]\.\\displaystyle=\\frac\{\\alpha\}\{2\-\\alpha\}\\,\\theta\(1\-\\theta\)\\,\\bigl\[1\-\(1\-\\alpha\)^\{2t\}\\bigr\]\.Ast→∞t\\to\\infty:Var​\(a^t\)→α2−α​θ​\(1−θ\)\\mathrm\{Var\}\(\\hat\{a\}\_\{t\}\)\\to\\frac\{\\alpha\}\{2\-\\alpha\}\\,\\theta\(1\-\\theta\)\.

Convergence in probability\.The mean squared error decomposes as

𝔼​\[\(a^t−θ\)2\]=Var​\(a^t\)\+\[\(1−α\)t​\(a^0−θ\)\]2\.\\mathbb\{E\}\[\(\\hat\{a\}\_\{t\}\-\\theta\)^\{2\}\]=\\mathrm\{Var\}\(\\hat\{a\}\_\{t\}\)\+\\bigl\[\(1\-\\alpha\)^\{t\}\(\\hat\{a\}\_\{0\}\-\\theta\)\\bigr\]^\{2\}\.Both terms converge: the variance to its steady\-state value, the squared bias to zero\. By Chebyshev’s inequality, for anyε\>0\\varepsilon\>0:

ℙ​\(\|a^t−θ\|\>ε\)≤𝔼​\[\(a^t−θ\)2\]ε2→α​θ​\(1−θ\)\(2−α\)​ε2\.\\mathbb\{P\}\(\|\\hat\{a\}\_\{t\}\-\\theta\|\>\\varepsilon\)\\leq\\frac\{\\mathbb\{E\}\[\(\\hat\{a\}\_\{t\}\-\\theta\)^\{2\}\]\}\{\\varepsilon^\{2\}\}\\to\\frac\{\\alpha\\,\\theta\(1\-\\theta\)\}\{\(2\-\\alpha\)\\varepsilon^\{2\}\}\.This is the irreducible error\. Settingε=3​Stdss\\varepsilon=3\\,\\mathrm\{Std\}\_\{\\text\{ss\}\}gives probability≤1/9\\leq 1/9\. ∎

### Proof of Proposition[3](https://arxiv.org/html/2605.22949#Thmtheorem3)\(Tracking Speed\)

After the shift att0t\_\{0\}, outcomes are i\.i\.d\.Bernoulli​\(θ′\)\\mathrm\{Bernoulli\}\(\\theta^\{\\prime\}\)\. The estimator att0\+nt\_\{0\}\+nis a fresh EWMA initialised ata^t0\\hat\{a\}\_\{t\_\{0\}\}trackingθ′\\theta^\{\\prime\}\. By Proposition[2](https://arxiv.org/html/2605.22949#Thmtheorem2):

𝔼​\[a^t0\+n\]=θ′\+\(1−α\)n​\(a^t0−θ′\)\.\\mathbb\{E\}\[\\hat\{a\}\_\{t\_\{0\}\+n\}\]=\\theta^\{\\prime\}\+\(1\-\\alpha\)^\{n\}\\,\(\\hat\{a\}\_\{t\_\{0\}\}\-\\theta^\{\\prime\}\)\.The bias magnitude is\(1−α\)n​\|a^t0−θ′\|\(1\-\\alpha\)^\{n\}\\,\|\\hat\{a\}\_\{t\_\{0\}\}\-\\theta^\{\\prime\}\|\. If the pre\-shift estimator had converged,a^t0≈θ\\hat\{a\}\_\{t\_\{0\}\}\\approx\\theta, so\|a^t0−θ′\|≈\|Δ\|\|\\hat\{a\}\_\{t\_\{0\}\}\-\\theta^\{\\prime\}\|\\approx\|\\Delta\|\.

Setting\(1−α\)n​\|Δ\|≤ε\(1\-\\alpha\)^\{n\}\|\\Delta\|\\leq\\varepsilonand using\(1−α\)n≤e−α​n\(1\-\\alpha\)^\{n\}\\leq e^\{\-\\alpha n\}:

e−α​n​\|Δ\|≤ε⟹n≥1α​ln⁡\|Δ\|ε\.e^\{\-\\alpha n\}\\,\|\\Delta\|\\leq\\varepsilon\\implies n\\geq\\frac\{1\}\{\\alpha\}\\ln\\frac\{\|\\Delta\|\}\{\\varepsilon\}\.Atα=0\.04\\alpha=0\.04: recovering to withinε=0\.01\\varepsilon=0\.01after a shift ofΔ=0\.20\\Delta=0\.20requiresn≥25​ln⁡\(20\)≈75n\\geq 25\\ln\(20\)\\approx 75observations\. This is consistent with the empirical observation that MARGIN recovers within approximately 50–100 observations after a severe distribution shift \(Section[6\.3](https://arxiv.org/html/2605.22949#S6.SS3)\)\. ∎

### Proof of Proposition[4](https://arxiv.org/html/2605.22949#Thmtheorem4)\(Bias\-Variance Tradeoff\)

Consider an environment whereθt=θ0\+δ​t\\theta\_\{t\}=\\theta\_\{0\}\+\\delta\\,tdrifts linearly\. At any timett, the expected estimator value lags behind the true value\. The EWMA is a weighted average of past observations, with effective centre of mass approximately1/\(2​α\)1/\(2\\alpha\)steps in the past\. The expected tracking lag is therefore approximatelyδ/\(2​α\)\\delta/\(2\\alpha\)\.

From Proposition[2](https://arxiv.org/html/2605.22949#Thmtheorem2), the steady\-state standard deviation isα​θ​\(1−θ\)/\(2−α\)\\sqrt\{\\alpha\\,\\theta\(1\-\\theta\)/\(2\-\\alpha\)\}\.

By the triangle inequality, the expected absolute error is bounded by the sum:

𝔼​\[\|a^t−θt\|\]≤δ2​α\+α​θ​\(1−θ\)2−α\.\\mathbb\{E\}\[\|\\hat\{a\}\_\{t\}\-\\theta\_\{t\}\|\]\\leq\\frac\{\\delta\}\{2\\alpha\}\+\\sqrt\{\\frac\{\\alpha\\,\\theta\(1\-\\theta\)\}\{2\-\\alpha\}\}\.The first term is decreasing inα\\alpha; the second is increasing\. Differentiating and setting to zero \(approximating2−α≈22\-\\alpha\\approx 2for smallα\\alpha\):

δ2​α2=12​θ​\(1−θ\)2​α⟹α∗=O​\(δ2/3\)\.\\frac\{\\delta\}\{2\\alpha^\{2\}\}=\\frac\{1\}\{2\}\\sqrt\{\\frac\{\\theta\(1\-\\theta\)\}\{2\\alpha\}\}\\implies\\alpha^\{\*\}=O\\\!\\left\(\\delta^\{2/3\}\\right\)\.In practice, the optimum is broad: the ablation in Section[10\.1](https://arxiv.org/html/2605.22949#S10.SS1)shows thatα∈\[0\.02,0\.08\]\\alpha\\in\[0\.02,0\.08\]gives similar performance across moderate shift severities, withα=0\.04\\alpha=0\.04as the robust default\. ∎

### Proof of Proposition[5](https://arxiv.org/html/2605.22949#Thmtheorem5)\(Symmetric Optimality\)

Consider the asymmetric EWMA:

a^t=\{\(1−αup\)​a^t−1\+αupif​Xt=1,\(1−αdown\)​a^t−1if​Xt=0\.\\hat\{a\}\_\{t\}=\\begin\{cases\}\(1\-\\alpha\_\{\\mathrm\{up\}\}\)\\hat\{a\}\_\{t\-1\}\+\\alpha\_\{\\mathrm\{up\}\}&\\text\{if \}X\_\{t\}=1,\\\\ \(1\-\\alpha\_\{\\mathrm\{down\}\}\)\\hat\{a\}\_\{t\-1\}&\\text\{if \}X\_\{t\}=0\.\\end\{cases\}Taking expectations at the stationary pointa^∞\\hat\{a\}\_\{\\infty\}:

𝔼​\[a^∞\]\\displaystyle\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=θ​\[\(1−αup\)​𝔼​\[a^∞\]\+αup\]\+\(1−θ\)​\(1−αdown\)​𝔼​\[a^∞\]\\displaystyle=\\theta\\,\\bigl\[\(1\-\\alpha\_\{\\mathrm\{up\}\}\)\\,\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]\+\\alpha\_\{\\mathrm\{up\}\}\\bigr\]\+\(1\-\\theta\)\\,\(1\-\\alpha\_\{\\mathrm\{down\}\}\)\\,\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=𝔼​\[a^∞\]​\[1−αup​θ−αdown​\(1−θ\)\]\+αup​θ\.\\displaystyle=\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]\\,\\bigl\[1\-\\alpha\_\{\\mathrm\{up\}\}\\theta\-\\alpha\_\{\\mathrm\{down\}\}\(1\-\\theta\)\\bigr\]\+\\alpha\_\{\\mathrm\{up\}\}\\theta\.Solving:

𝔼​\[a^∞\]​\[αup​θ\+αdown​\(1−θ\)\]=αup​θ⟹𝔼​\[a^∞\]=αup​θαup​θ\+αdown​\(1−θ\)\.\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]\\,\\bigl\[\\alpha\_\{\\mathrm\{up\}\}\\theta\+\\alpha\_\{\\mathrm\{down\}\}\(1\-\\theta\)\\bigr\]=\\alpha\_\{\\mathrm\{up\}\}\\theta\\implies\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=\\frac\{\\alpha\_\{\\mathrm\{up\}\}\\,\\theta\}\{\\alpha\_\{\\mathrm\{up\}\}\\,\\theta\+\\alpha\_\{\\mathrm\{down\}\}\\,\(1\-\\theta\)\}\.Setting𝔼​\[a^∞\]=θ\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=\\thetarequiresαup​θ​\(1−θ\)=αdown​θ​\(1−θ\)\\alpha\_\{\\mathrm\{up\}\}\\theta\(1\-\\theta\)=\\alpha\_\{\\mathrm\{down\}\}\\theta\(1\-\\theta\), i\.e\.,αup=αdown\\alpha\_\{\\mathrm\{up\}\}=\\alpha\_\{\\mathrm\{down\}\}\.

Whenαdown\>αup\\alpha\_\{\\mathrm\{down\}\}\>\\alpha\_\{\\mathrm\{up\}\}\(penalising errors faster\), the estimator is biased downward\. Forθ=0\.8\\theta=0\.8,αup=0\.02\\alpha\_\{\\mathrm\{up\}\}=0\.02,αdown=0\.06\\alpha\_\{\\mathrm\{down\}\}=0\.06:

𝔼​\[a^∞\]=0\.0160\.016\+0\.012=0\.0160\.028≈0\.571\.\\mathbb\{E\}\[\\hat\{a\}\_\{\\infty\}\]=\\frac\{0\.016\}\{0\.016\+0\.012\}=\\frac\{0\.016\}\{0\.028\}\\approx 0\.571\.The estimator converges to 0\.571 instead of the true 0\.80, a bias of−0\.229\-0\.229\. This systematic underconfidence compounds across all agents and bands, predicting the 3–4×\\timesECE degradation observed empirically\.

For the optimality claim: among EWMA estimators with potentially different up/down rates, only the symmetric caseαup=αdown\\alpha\_\{\\mathrm\{up\}\}=\\alpha\_\{\\mathrm\{down\}\}is unbiased under i\.i\.d\. Bernoulli outcomes\. Since the variance of the symmetric estimator matches the Cramér–Rao\-type lower bound for exponentially weighted estimators of bounded random variables, it is the minimum\-variance unbiased estimator within this family\. ∎

### Proof of Proposition[6](https://arxiv.org/html/2605.22949#Thmtheorem6)\(Selection Monotonicity\)

ConsiderNNagents with true accuraciesp1\>p2≥⋯≥pNp\_\{1\}\>p\_\{2\}\\geq\\cdots\\geq p\_\{N\}responding to a single task\. Each agent’s calibrated confidence isc~i=pi\+ηi\\tilde\{c\}\_\{i\}=p\_\{i\}\+\\eta\_\{i\}, whereηi\\eta\_\{i\}are independent zero\-mean noise terms with common varianceσ2\\sigma^\{2\}\(the residual calibration error\)\.

The best agent is selected whenc~1\>c~j\\tilde\{c\}\_\{1\}\>\\tilde\{c\}\_\{j\}for allj≠1j\\neq 1, i\.e\., whenη1−ηj\>−\(p1−pj\)\\eta\_\{1\}\-\\eta\_\{j\}\>\-\(p\_\{1\}\-p\_\{j\}\)for alljj\. Asσ2→0\\sigma^\{2\}\\to 0, the noise vanishes and the best agent is selected with probability 1\. Asσ2→∞\\sigma^\{2\}\\to\\infty, selection becomes uniform random with probability1/N1/N\.

For intermediateσ2\\sigma^\{2\}, the probability of correct selection is:

ℙ​\[c~1=maxj⁡c~j\]=𝔼​\[∏j=2NΦ​\(p1−pj\+η1σ\)\],\\mathbb\{P\}\[\\tilde\{c\}\_\{1\}=\\max\_\{j\}\\tilde\{c\}\_\{j\}\]=\\mathbb\{E\}\\\!\\left\[\\prod\_\{j=2\}^\{N\}\\Phi\\\!\\left\(\\frac\{p\_\{1\}\-p\_\{j\}\+\\eta\_\{1\}\}\{\\sigma\}\\right\)\\right\],whereΦ\\Phiis the CDF ofηj/σ\\eta\_\{j\}/\\sigma\. Each factor inside the product is increasing asσ\\sigmadecreases \(for fixedp1−pj\>0p\_\{1\}\-p\_\{j\}\>0\), establishing the monotonicity \([18](https://arxiv.org/html/2605.22949#S4.E18)\)\.

For the anti\-correlation case: when raw confidence satisfiesCorr​\(ci,pi\)<0\\mathrm\{Corr\}\(c\_\{i\},p\_\{i\}\)<0across agents \(as observed on hard benchmarks where weaker models are more confident\), confidence\-weighted selection systematically favours the wrong agent\. The pairwise resolution probability,ℙ​\[c~1\>c~2​∣p1\>​p2\]\\mathbb\{P\}\[\\tilde\{c\}\_\{1\}\>\\tilde\{c\}\_\{2\}\\mid p\_\{1\}\>p\_\{2\}\], falls below 0\.5\. Any calibration that corrects the sign of the correlation \(reducing ECE below the level at which the correlation flips\) must raise pairwise resolution above 0\.5\. ∎

Similar Articles

Counterfactual Graph for Multi-Agent LLM Calibration

arXiv cs.CL

This paper introduces CAGE, a counterfactual graph-based method for calibrating multi-agent LLM systems, evaluating on benchmarks like TriviaQA and MMLU-Pro across various communication topologies. The method outperforms existing post-hoc and LLM-elicited calibration approaches.

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

arXiv cs.LG

This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

arXiv cs.LG

This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.