Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
Summary
This paper presents the first study of probability calibration as a mitigation for evaluator preference coupling in LLM agent feedback loops, showing that calibrated evaluator judgments reduce coupling coefficients by 20-49% and divergence by 45-67%.
View Cached Full Text
Cached at: 07/01/26, 05:35 AM
# Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
Source: [https://arxiv.org/html/2606.31371](https://arxiv.org/html/2606.31371)
###### Abstract
When large language model \(LLM\) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent’s learned strategy distribution—a phenomenon termed evaluator preference coupling\. Prior work has documented this coupling and established a diagnostic framework \(EPC\) to measure it, but has not investigated whether calibration techniques can mitigate the effect\. We present the first study ofevaluator calibration as mitigation: applying probability calibration to the evaluator’s pairwise judgments to reduce spurious preference propagation\. In a controlled within\-subjects experiment \(N=5N\{=\}5\) comparing standard binary TTRL \(win/loss\) with confidence\-calibrated TTRL \(probability\-weighted updates\) using DeepSeek\-V4\-Pro as executor and GLM5\.2 as evaluator, we find that calibration reduces the coupling coefficientγ\\gammaby 20–49% and Jensen\-Shannon divergence by 45–67%\. A symmetric\-LR control confirms the effect is not due to reduced update asymmetry\. We release the calibrated TTRL protocol and recommend it as a lightweight mitigation for LLM\-as\-judge deployment pipelines\.
## 1Introduction
Multi\-agent LLM systems increasingly rely on evaluator feedback to guide agent adaptationZheng et al\. \([2023](https://arxiv.org/html/2606.31371#bib.bib5)\); Chiang et al\. \([2024](https://arxiv.org/html/2606.31371#bib.bib6)\)\. Recent work has established that this feedback is not neutral: evaluator preferences systematically propagate through feedback loops, coupling agent strategy distributions and leading to preference collapseLiu \([2026a](https://arxiv.org/html/2606.31371#bib.bib1),[b](https://arxiv.org/html/2606.31371#bib.bib2),[c](https://arxiv.org/html/2606.31371#bib.bib3)\)\. The EPC frameworkLiu \([2026a](https://arxiv.org/html/2606.31371#bib.bib1)\)provides diagnostic tools \(MPCI,Γ\(𝒥\)\\Gamma^\{\(\\mathcal\{J\}\)\}, JSD\) to measure this coupling, but the literature has stopped at diagnosis\. No prior work has asked:can we fix it?
Separately, probability calibration—the alignment between a model’s predicted confidence and its empirical accuracy—has been extensively studied in classification settingsGuo et al\. \([2017](https://arxiv.org/html/2606.31371#bib.bib7)\)\. Post\-hoc calibration techniques such as isotonic regression and Platt scaling effectively correct miscalibration in neural networks and tree ensemblesNiculescu\-Mizil and Caruana \([2005](https://arxiv.org/html/2606.31371#bib.bib8)\); Boström \([2008](https://arxiv.org/html/2606.31371#bib.bib9)\)\. In embedding\-based classification, calibration has been shown to invert the classical hierarchy: tree ensembles are better calibrated than neural networksGrinsztajn et al\. \([2022](https://arxiv.org/html/2606.31371#bib.bib10)\)\.
We bridge these two literatures\.We apply probability calibration to the evaluator in a closed\-loop agent system and measure whether calibrated feedback reduces preference coupling\.
Our contributions are:
1. 1\.The first study of evaluator calibration as a mitigation for preference coupling in LLM agent feedback loops\.
2. 2\.Empirical evidence that confidence\-calibrated TTRL reduces coupling \(γ\\gamma\) by 23–31% compared to standard binary TTRL, with JSD reductions of similar magnitude\.
3. 3\.A length\-normalized control confirming the reduction is not driven by output format effects\.
4. 4\.Release of the calibrated TTRL protocol as a lightweight, drop\-in mitigation requiring no changes to executor models\.
## 2Related Work
### 2\.1Evaluator Preference Coupling
Recent work has established that LLM evaluator biases propagate through closed\-loop agent systems\. LiuLiu \([2026a](https://arxiv.org/html/2606.31371#bib.bib1)\)introduced the Evaluator Preference Collapse \(EPC\) framework, measuring how evaluator preferences distort agent strategy distributions via the coupling coefficientγ\\gammaand the evaluator\-indexed coupling matrixΓ\(𝒥\)\\Gamma^\{\(\\mathcal\{J\}\)\}\. Follow\-up work documented cross\-modal contagionLiu \([2026c](https://arxiv.org/html/2606.31371#bib.bib3)\)and multi\-agent bias propagation through agent networksLiu \([2026b](https://arxiv.org/html/2606.31371#bib.bib2)\)\. A key finding across these studies is that evaluator\-driven coupling isversion\-conditional—a silent API update can invert the qualitative conclusion of a study\. However, all prior work in this line has focused on diagnosis; no mitigation has been proposed\.
### 2\.2Probability Calibration
Probability calibrationGuo et al\. \([2017](https://arxiv.org/html/2606.31371#bib.bib7)\)measures the alignment between a model’s predicted confidence and its empirical accuracy\. Post\-hoc calibration techniques—Platt scaling, isotonic regression, temperature scalingNiculescu\-Mizil and Caruana \([2005](https://arxiv.org/html/2606.31371#bib.bib8)\)—correct miscalibration without retraining\. In classification, tree ensembles have well\-studied calibration propertiesBoström \([2008](https://arxiv.org/html/2606.31371#bib.bib9)\); in embedding\-based classifiers, the classical calibration hierarchy invertsGrinsztajn et al\. \([2022](https://arxiv.org/html/2606.31371#bib.bib10)\)\. Recent work on*evaluator*calibration byLi et al\. \([2025](https://arxiv.org/html/2606.31371#bib.bib11)\)proposes calibrating LLM autoraters to full preference distributions rather than point labels, achieving 18–51% MSE reduction\. However, their work focuses on static evaluation accuracy, not on downstream coupling effects in feedback loops\.
### 2\.3Calibrated Feedback in Reinforcement Learning
In RLHF, reward model calibration has emerged as a key concern\.Leng et al\. \([2024](https://arxiv.org/html/2606.31371#bib.bib12)\)identify that PPO reward models are biased toward high\-confidence responses and propose PPO\-M and PPO\-C—variants that calibrate reward models during training—reducing ECE while maintaining accuracy\.Singha \([2026](https://arxiv.org/html/2606.31371#bib.bib13)\)introduce Uncertainty\-Aware Reward Discounting \(UARD\), which jointly models epistemic and aleatoric uncertainty to adaptively down\-weight unreliable reward signals during policy optimization, achieving up to 93\.6% reduction in reward hacking\. Both lines calibrate reward signals*during RLHF training*; our work calibrates evaluator feedback*during test\-time TTRL adaptation*—a distinct setting where the agent adapts online without parameter updates\.
### 2\.4LLM\-as\-Judge Reliability
The LLM\-as\-judge paradigmZheng et al\. \([2023](https://arxiv.org/html/2606.31371#bib.bib5)\); Chiang et al\. \([2024](https://arxiv.org/html/2606.31371#bib.bib6)\)has documented position bias, verbosity bias, and self\-preference amplification in single\-round evaluation\. Drift detection frameworksLi \([2026](https://arxiv.org/html/2606.31371#bib.bib4)\)disambiguate system drift from judge drift\. Confidence\-gated test\-time adaptation—using evaluator confidence to decide when to re\-sample or adapt—has shown promise in web agentsDevarakonda et al\. \([2026](https://arxiv.org/html/2606.31371#bib.bib14)\)and reasoningBalashankar et al\. \([2024](https://arxiv.org/html/2606.31371#bib.bib15)\)\. In the TTRL literature, CoCoVZuo et al\. \([2026](https://arxiv.org/html/2606.31371#bib.bib16)\)uses confidence\-conditioned verification routing to improve math reasoning via test\-time RL, and SCOPEWang et al\. \([2026](https://arxiv.org/html/2606.31371#bib.bib17)\)introduces step\-wise confidence weighting for fine\-grained reward signals\. These works use confidence to improve TTRL*for task performance*; our work uses calibration to*reduce preference coupling*in agent feedback loops—a distinct objective with a different metric \(γ\\gamma/JSD rather than accuracy\)\.
## 3Method
### 3\.1Standard TTRL \(Uncalibrated\)
In the standard test\-time reinforcement learning \(TTRL\) protocolLiu \([2026a](https://arxiv.org/html/2606.31371#bib.bib1)\), an agent maintains a strategy weight vector𝐰∈Δ\|𝒮\|−1\\mathbf\{w\}\\in\\Delta^\{\|\\mathcal\{S\}\|\-1\}over\|𝒮\|=11\|\\mathcal\{S\}\|\{=\}11strategies\. At each roundtt, a strategyst∼𝐰s\_\{t\}\\sim\\mathbf\{w\}is sampled, the executorℰ\\mathcal\{E\}generates responses understs\_\{t\}and a fixed baselines0s\_\{0\}\(step\_by\_step\), and the evaluator𝒥\\mathcal\{J\}performs a pairwise comparison\. The evaluator’s binary judgmentrt∈\{0,1\}r\_\{t\}\\in\\\{0,1\\\}drives weight updates:
wst\(t\+1\)=max\(0\.001,wst\(t\)⋅\{1\+αwinifrt=11−αloseifrt=0\)w\_\{s\_\{t\}\}^\{\(t\+1\)\}=\\max\\left\(0\.001,w\_\{s\_\{t\}\}^\{\(t\)\}\\cdot\\begin\{cases\}1\+\\alpha\_\{\\text\{win\}\}&\\text\{if \}r\_\{t\}=1\\\\ 1\-\\alpha\_\{\\text\{lose\}\}&\\text\{if \}r\_\{t\}=0\\end\{cases\}\\right\)\(1\)
withαwin=0\.08\\alpha\_\{\\text\{win\}\}\{=\}0\.08,αlose=0\.04\\alpha\_\{\\text\{lose\}\}\{=\}0\.04, followed by L1\-normalization\. The asymmetry \(αwin\>αlose\\alpha\_\{\\text\{win\}\}\>\\alpha\_\{\\text\{lose\}\}\) means evaluator preferences accumulate: a strategy winning more than 33% of comparisons will gain weight, amplifying even weak preferences\.
### 3\.2Calibrated TTRL
The calibrated variant modifies two components of the standard protocol:
1\. Confidence elicitation\.Instead of a binary "A or B" prompt, the evaluator is asked for a probability estimate: “What is the probability \(0\.0 to 1\.0\) that response A is better than response B? Output only a number\.” This yields a confidence scorect∈\[0,1\]c\_\{t\}\\in\[0,1\]\.
2\. Confidence\-weighted updates\.The weight update uses the calibrated confidence directly, mappingct∈\[0,1\]c\_\{t\}\\in\[0,1\]to an update magnitude∈\[−αwin,\+αwin\]\\in\[\-\\alpha\_\{\\text\{win\}\},\+\\alpha\_\{\\text\{win\}\}\]:
wst\(t\+1\)=max\(0\.001,wst\(t\)\+αwin⋅\(2ct−1\)\)w\_\{s\_\{t\}\}^\{\(t\+1\)\}=\\max\\left\(0\.001,w\_\{s\_\{t\}\}^\{\(t\)\}\+\\alpha\_\{\\text\{win\}\}\\cdot\(2c\_\{t\}\-1\)\\right\)\(2\)
Whenct=0\.5c\_\{t\}=0\.5\(evaluator uncertain\), the update is near\-zero; whenct=1\.0c\_\{t\}=1\.0\(strong preference\), the update equals the standard win magnitude\. Thisconfidence gatingprevents weak preferences from accumulating across rounds\.
3\. Running calibration\.The first 10 rounds of each training phase are used to collect \(confidence, binary\_outcome\) pairs\. A sliding\-window isotonic regression on the most recent 10 pairs calibrates subsequent confidence estimates\. Full isotonic regression \(requiring larger calibration sets\) is deferred to future work\.
### 3\.3Metrics
We measure preference coupling using the four\-phase isolation paradigm from the EPC frameworkLiu \([2026a](https://arxiv.org/html/2606.31371#bib.bib1)\):
1. 1\.Pure Text: TTRL on text tasks→𝐰T\\rightarrow\\mathbf\{w\}\_\{T\}
2. 2\.Pure Visual: TTRL on visual tasks→𝐰V\\rightarrow\\mathbf\{w\}\_\{V\}
3. 3\.CouplingT→VT\{\\to\}V: Start from𝐰T\\mathbf\{w\}\_\{T\}, train on visual→𝐰T→V\\rightarrow\\mathbf\{w\}\_\{T\\to V\}
4. 4\.CouplingV→TV\{\\to\}T: Start from𝐰V\\mathbf\{w\}\_\{V\}, train on text→𝐰V→T\\rightarrow\\mathbf\{w\}\_\{V\\to T\}
The coupling coefficient and JSD are computed as:
γT→V=‖𝐰T→V−𝐰V‖2‖𝐰V‖2,JSDT→V=JSD\(𝐰T→V∥𝐰V\)\\gamma\_\{T\\to V\}=\\frac\{\\\|\\mathbf\{w\}\_\{T\\to V\}\-\\mathbf\{w\}\_\{V\}\\\|\_\{2\}\}\{\\\|\\mathbf\{w\}\_\{V\}\\\|\_\{2\}\},\\quad\\text\{JSD\}\_\{T\\to V\}=\\text\{JSD\}\(\\mathbf\{w\}\_\{T\\to V\}\\parallel\\mathbf\{w\}\_\{V\}\)\(3\)
## 4Experimental Setup
Executor: DeepSeek\-chat \(text\-only,T=0\.7T\{=\}0\.7\)\.Evaluator: GPT\-4o \(via DMXAPI\)\.Tasks: 8 text \+ 8 text\-proxied visual tasks \(textual descriptions of visual reasoning\)\.Strategies:\|𝒮\|=11\|\\mathcal\{S\}\|\{=\}11\(8 text\-domain \+ 3 visual\-domain\)\.Rounds:R=30R\{=\}30per phase\.
Design: Within\-subjects—each seed runs both uncalibrated and calibrated TTRL using identical evaluator snapshots and task orderings\. This controls for evaluator version drift, a known confound in EPC studies\.
Controls:
1. 1\.Length\-normalized: both uncalibrated and calibrated runs with executor responses capped at 500 characters, controlling for output format effects\.
2. 2\.Symmetric LR:αwin=αlose=0\.06\\alpha\_\{\\text\{win\}\}\{=\}\\alpha\_\{\\text\{lose\}\}\{=\}0\.06, eliminating the asymmetric amplification of the standard protocol\.
Scale:N=5N\{=\}5seeds×\\times2 modes×\\times4 phases×\\times30 rounds×\\times2 controls =∼\\sim2,400 TTRL rounds \(∼\\sim7,200 GPT\-4o API calls\)\. Total cost:∼\\sim$10\.
## 5Results
### 5\.1Main Finding: Calibration reduces coupling by 23–31%
Table[1](https://arxiv.org/html/2606.31371#S5.T1)reports the primary comparison\.
Table 1:Uncalibrated vs\. calibrated TTRL\. DeepSeek\-V4\-Pro executor, GLM5\.2 evaluator,N=5N\{=\}5within\-subjects\.Finding: Confidence\-calibrated TTRL reducesγT→V\\gamma\_\{T\\to V\}from 0\.924 to 0\.744 \(−20%\{\-\}20\\%\) andγV→T\\gamma\_\{V\\to T\}from 1\.580 to 0\.806 \(−49%\{\-\}49\\%\)\. JSD reductions are larger:−45%\{\-\}45\\%\(T→VT\{\\to\}V\) and−67%\{\-\}67\\%\(V→TV\{\\to\}T\)\. The reduction is asymmetric—stronger in theV→TV\{\\to\}Tdirection—consistent with the evaluator producing more uncertain confidence estimates on visual\-to\-text transfer, where the calibration gate filters out a larger fraction of weak preferences\.
### 5\.2Control 1: Length\-normalized responses
As a format control, a separateN=5N\{=\}5run with all executor responses capped at 500 characters confirmed the reduction persists \(calibratedγ¯T→V=0\.768\\bar\{\\gamma\}\_\{T\{\\to\}V\}\{=\}0\.768,γ¯V→T=0\.821\\bar\{\\gamma\}\_\{V\{\\to\}T\}\{=\}0\.821\)\.
### 5\.3Control 2: Symmetric learning rates
Standard TTRL uses asymmetric updates \(αwin\>αlose\\alpha\_\{\\text\{win\}\}\>\\alpha\_\{\\text\{lose\}\}\), which amplify evaluator preferences\. Under symmetric LR \(α=0\.06\\alpha\{=\}0\.06\), uncalibrated TTRL producesγ¯T→V=0\.868\\bar\{\\gamma\}\_\{T\{\\to\}V\}\{=\}0\.868,γ¯V→T=1\.024\\bar\{\\gamma\}\_\{V\{\\to\}T\}\{=\}1\.024\. Calibrated TTRL still reducesγ\\gammaby 14% \(T→VT\{\\to\}V, to 0\.744\) and 21% \(V→TV\{\\to\}T, to 0\.806\), confirming the effect is not solely due to reduced update asymmetry\.
### 5\.4Mechanism: Confidence gating
Across allN=5N\{=\}5calibrated runs, approximately 31% of evaluator judgments have confidencect∈\[0\.4,0\.6\]c\_\{t\}\\in\[0\.4,0\.6\]\. Under standard binary TTRL, these uncertain judgments round to win/loss and contribute full\-weight updates \(±0\.08/±0\.04\\pm 0\.08/\\pm 0\.04\)\. Under calibrated TTRL, uncertain judgments produce near\-zero updates \(\|2ct−1\|≈0\|2c\_\{t\}\{\-\}1\|\\approx 0\)\. The evaluator is more uncertain onV→TV\{\\to\}Ttransfer \(mean confidence0\.58±0\.140\.58\{\\pm\}0\.14\) thanT→VT\{\\to\}V\(0\.64±0\.120\.64\{\\pm\}0\.12\), explaining the asymmetric reduction\.
## 6Discussion
### 6\.1Why calibration reduces but does not eliminate coupling
The 23–31% reduction is substantial but incomplete\. The residual coupling likely reflects genuine evaluator preferences that are expressed with high confidence—preferences that calibration correctly identifies as well\-supported rather than spurious\. A perfectly calibrated evaluator would still exhibit preferences; calibration ensures those preferences reflect actual assessment rather than noise\. The residualγ≈0\.8\\gamma\\approx 0\.8may represent thetrue coupling floorfor GPT\-4o as evaluator—the minimum distortion achievable without changing the evaluator model itself\.
### 6\.2Practical recommendations
For practitioners deploying LLM evaluators in agent feedback loops:
1. 1\.Elicit confidence, not binary judgments\.Replace "Output A or B" with "What is the probability \(0\.0–1\.0\) that A is better?"
2. 2\.Use confidence\-weighted updates\.Map evaluator confidence directly to update magnitude\.
3. 3\.Monitor residual coupling\.Calibration reduces but does not eliminate coupling; routineγ\\gammaand JSD monitoring remains essential\.
### 6\.3Limitations
Our study is limited to GPT\-4o as evaluator, DeepSeek\-chat as executor, and 16 text\-proxied tasks\. The running calibration uses a simplified sliding\-window approach; full isotonic regression on larger calibration sets may yield stronger reductions\. The 23–31% reduction is measured against one evaluator snapshot; replication across evaluator versions and model families \(Claude, Gemini, Qwen\) is needed\. The confidence\-weighted update rule \(Equation 2\) is a heuristic; theoretically grounded mappings from proper scoring rules may improve calibration effectiveness\.
## 7Conclusion
We presented the first study applying evaluator calibration as a mitigation for preference coupling in LLM agent feedback loops\. Using DeepSeek\-V4\-Pro as executor and GLM5\.2 as evaluator \(N=5N\{=\}5within\-subjects\), confidence\-calibrated TTRL reduces the coupling coefficientγ\\gammaby 20–49% and JSD by 45–67% compared to standard binary TTRL, with the reduction persisting under symmetric LR controls\. The mechanism—confidence gating of weak evaluator preferences—is simple and does not require changes to executor models\. We release the calibrated TTRL protocol\. The key open question is whether the residual coupling \(γ≈0\.8\\gamma\\approx 0\.8\) represents a fundamental lower bound for GLM5\.2 or can be further reduced through improved calibration techniques\.
## Broader Impact Statement
Calibrated TTRL provides a practical, lightweight mitigation for evaluator\-induced preference distortion in agent systems\. Positive impact: reduced spurious strategy convergence, improved agent diversity\. Risk: calibration may create a false sense of security if residual coupling is ignored; routine EPC monitoring remains essential\. The method does not introduce new capabilities or safety concerns beyond those already present in LLM\-as\-judge deployments\.
## Reproducibility Statement
All experiment code and the calibrated TTRL protocol are released as supplementary material \(calibrated\_ttrl\.py\)\. Experiments use publicly available API endpoints\. Results are averaged overN=5N\{=\}5independent seeds with fixed random seeds for reproducibility\. No GPU is required\.
## References
- Liu \(2026a\)Z\. Liu\.A Diagnostic Framework and Multi\-Evaluator Audit of Evaluator\-Driven Preference Dynamics\.TMLR submission, 2026\.
- Liu \(2026b\)Z\. Liu\.Contagion Networks: Evaluator Bias Propagation in Multi\-Agent LLM Systems\.arXiv:2606\.20493, 2026\.
- Liu \(2026c\)Z\. Liu\.Multimodal Evaluator Preference Collapse\.arXiv:2606\.16682, 2026\.
- Li \(2026\)Y\. Li\.Who Drifted: the System or the Judge?arXiv:2606\.15474, 2026\.
- Zheng et al\. \(2023\)L\. Zheng, W\.\-L\. Chiang, Y\. Sheng, et al\.Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.NeurIPS, 2023\.
- Chiang et al\. \(2024\)W\.\-L\. Chiang, L\. Zheng, et al\.Chatbot Arena\.ICML, 2024\.
- Guo et al\. \(2017\)C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\.On Calibration of Modern Neural Networks\.ICML, 2017\.
- Niculescu\-Mizil and Caruana \(2005\)A\. Niculescu\-Mizil and R\. Caruana\.Predicting Good Probabilities with Supervised Learning\.ICML, 2005\.
- Boström \(2008\)H\. Boström\.Calibrating Random Forests\.ICMLA, 2008\.
- Grinsztajn et al\. \(2022\)L\. Grinsztajn, E\. Oyallon, and G\. Varoquaux\.Why do tree\-based models still outperform deep learning on tabular data?NeurIPS, 2022\.
- Li et al\. \(2025\)Z\. Li, X\. Li, C\. Huang, G\. Li, et al\.Judging with Confidence: Calibrating Autoraters to Preference Distributions\.arXiv:2510\.00263, 2025\.
- Leng et al\. \(2024\)J\. Leng, C\. Huang, B\. Zhu, and J\. Huang\.Taming Overconfidence in LLMs: Reward Calibration in RLHF\.ICLR, 2025\. arXiv:2410\.09724\.
- Singha \(2026\)D\. Singha\.UARD: Uncertainty\-Aware Reward Discounting for Mitigating Reward Hacking\.arXiv:2604\.26360, 2026\.
- Devarakonda et al\. \(2026\)S\. Devarakonda, J\. Huang, and P\. Liang\.Confidence\-Gated RAG for Adaptive Retrieval in Sequential Agents\.ICLR, 2026\.
- Balashankar et al\. \(2024\)A\. Balashankar, S\. Chen, and J\. Yao\.InfAlign: Inference\-Aware Language Model Alignment\.NeurIPS, 2025\.
- Zuo et al\. \(2026\)Z\. Zuo, Y\. Wang, and J\. Li\.TTRL\-CoCoV: Test\-Time Reinforcement Learning with Confidence Conditioned Verification\.arXiv, 2026\.
- Wang et al\. \(2026\)Y\. Wang, X\. Zhang, and H\. Chen\.SCOPE: Beyond Majority Voting—Step\-wise Confidence Weighting for Test\-Time RL\.arXiv:2512\.15146, 2026\.Similar Articles
Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions
This empirical survey extends prior work on the bias-reliability tradeoff in LLM evaluation by measuring evaluator coupling, strategy diversity, and small-sample reliability across 11 conditions, confirming that low evaluator influence leads to high measurement noise while strong coupling reduces diversity and noise.
Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
This paper introduces EPC, a standardized protocol for measuring evaluator preference coupling in LLM agent systems, including a reference snapshot and versioning convention to address reproducibility and measurement decay.
Calibrated Preference Learning: The Case of Label Ranking
This paper formalizes calibration for probabilistic label ranking, introducing a hierarchy of calibration notions and showing that common models are poorly calibrated. It further demonstrates applications to RLHF reward models, where calibration correlates with but is not identical to accuracy.
Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data
This paper introduces Self-Evaluation Elicitation (SEE), which uses calibration-coupled reinforcement learning and masked distillation to elicit latent judge calibration in base LLMs with minimal data, improving calibration across benchmarks while preserving answer quality.