EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
Summary
This paper introduces EPC, a standardized protocol for measuring evaluator preference coupling in LLM agent systems, including a reference snapshot and versioning convention to address reproducibility and measurement decay.
View Cached Full Text
Cached at: 07/02/26, 05:37 AM
# EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
Source: [https://arxiv.org/html/2607.00297](https://arxiv.org/html/2607.00297)
###### Abstract
When LLM agents use evaluator feedback to adapt their behavior in closed loops, evaluator biases propagate through the agent’s strategy distribution—a phenomenon known as evaluator preference coupling\. Prior work has documented coupling across multiple evaluator families and model versions, but the field lacks a standardized protocol that enables third\-party researchers to \(i\) reproduce coupling measurements, \(ii\) compare results across evaluators and time points, and \(iii\) detect measurement decay as proprietary evaluators silently update\. This paper provides the protocol\. We specify EPC \(Evaluator Preference Coupling\)—a detailed, RFC\-style protocol specification for the four\-phase isolation paradigm, covering executor and evaluator configuration, strategy and task design, the TTRL update rule, metric computation \(γ\\gamma, JSD, ECE, Brier\), and output schema\. We accompany the protocol with a versioned Reference Snapshot v1\.0: coupling measurements for eight evaluator conditions \(N=122N\{=\}122unique experimental repetitions across GPT\-4o, Qwen, DeepSeek, and others\) derived from five independent studies, annotated with evaluator version identifiers, API endpoints, and measurement dates\. The snapshot is explicitly time\-bound: all values are conditional on specific model versions and are expected to decay as proprietary evaluators update\. We define a versioning convention \(vXX\.YY\-ZZ, encoding protocol version, snapshot version, and evaluator generation\) and provide a usage guide covering adoption, interpretation, and known pitfalls\. The protocol, reference snapshot, and implementation code are released as open infrastructure\.
## 1Introduction
Evaluator\-driven preference dynamics have been documented across multiple LLM agent configurationsLiu \([2026a](https://arxiv.org/html/2607.00297#bib.bib1);[b](https://arxiv.org/html/2607.00297#bib.bib2);[c](https://arxiv.org/html/2607.00297#bib.bib3)\)\. In the standard setup, an agent maintains a strategy weight distribution, receives pairwise feedback from an evaluator, and adapts via test\-time reinforcement learning \(TTRL\)\. The coupling coefficientγ\\gammaand Jensen\-Shannon divergence \(JSD\) quantify how strongly the evaluator’s preferences transfer across task domains and how concentrated the agent’s strategy distribution becomes\.
However, the field currently operates without a standardized protocol\. Each study uses slightly different protocol variants, task sets, strategy definitions, and metric implementations\. Cross\-study comparison is impossible\. More critically, proprietary evaluators silently update, causing measurements to decay within weeksLiu \([2026a](https://arxiv.org/html/2607.00297#bib.bib1)\)\. Without versioned baselines and explicit expiration dates, the literature accumulates measurements that are no longer valid for current model versions\. This problem is not unique to evaluator coupling: a recent audit of 26 AI benchmarks found that the median benchmark has a longevity score of just 5 out of 100BenchRisk \([2026](https://arxiv.org/html/2607.00297#bib.bib11)\), and the ML community is shifting toward continuous, versioned, community\-governed evaluation infrastructureMLCommons \([2026](https://arxiv.org/html/2607.00297#bib.bib13)\); SWE\-rebench \([2025](https://arxiv.org/html/2607.00297#bib.bib12)\); HF Community \([2026](https://arxiv.org/html/2607.00297#bib.bib16)\)\.
This paper provides the protocol, the reference snapshot, and the versioning convention\.
This paper isnot a claim of new empirical findings\.It is a protocol specification paper—analogous to an RFC in the networking community or a measurement standard in the physical sciences\. The coupling measurements in the reference snapshot have been previously reported in domain\-specific studiesLiu \([2026a](https://arxiv.org/html/2607.00297#bib.bib1)\)\. Our contribution is the standardization, versioning, and community infrastructure that transforms these measurements from one\-off observations into a reproducible, comparable, and auditable measurement system\. We explicitly do not introduce new metrics, new experimental conditions, or new scientific claims\. We introduce a*discipline*—a protocol that enables the community to collectively maintain currency as evaluator models evolve\.
## 2Protocol Specification
This section provides the complete EPC protocol specification, organized as a reference document that can be independently implemented\.
### 2\.1Overview
The protocol measures evaluator preference coupling through a four\-phase isolation paradigm\. In Phase 1 \(Pure Text\), the agent undergoes TTRL on text\-only tasks\. In Phase 2 \(Pure Visual\), the same on visual\-adjacent tasks\. In Phase 3 \(T→VT\{\\to\}Vcoupling\), the agent starts from Phase 1 weights and trains on visual tasks\. In Phase 4 \(V→TV\{\\to\}Tcoupling\), the agent starts from Phase 2 weights and trains on text tasks\. The coupling coefficientγA→B\\gamma\_\{A\\to B\}quantifies how much the weight distribution shifts from the pure\-domain reference\.
### 2\.2Agent Configuration
Executor: Any LLM accessible via API\. The protocol is executor\-agnostic; the reference snapshot uses DeepSeek\-chat\.Evaluator: Any LLM accessible via API\. Can be the same model as the executor \(self\-evaluation\) or a different model \(cross\-model evaluation\)\. The evaluator identity must be recorded with version identifier \(e\.g\.,gpt\-4o\-2024\-08\-06\), API endpoint, and measurement date\.Strategies: 11 strategies \(8 text\-domain \+ 3 visual\-domain\), defined in Appendix[A\.1](https://arxiv.org/html/2607.00297#A1.SS1)\. Each strategy is a natural\-language prompt prefix\.Protocol requirement: strategies must be documented verbatim in the output manifest\. Researchers may substitute domain\-specific strategies but must report the full strategy text\.
### 2\.3TTRL Algorithm
The agent maintains an L1\-normalized weight vector𝐰∈ΔK−1\\mathbf\{w\}\\in\\Delta^\{K\-1\}overK=11K\{=\}11strategies\. At each roundtt:
1. 1\.Sample strategyst∼𝐰s\_\{t\}\\sim\\mathbf\{w\}via roulette\-wheel selection\.
2. 2\.Executor generates response understs\_\{t\}and fixed baselines0s\_\{0\}\(step\_by\_step\)\.
3. 3\.Evaluator performs pairwise comparison: preferssts\_\{t\}\(win,rt=1r\_\{t\}\{=\}1\) ors0s\_\{0\}\(loss,rt=0r\_\{t\}\{=\}0\)\.
4. 4\.Weight update:wst←max\(0\.001,wst\+α\)w\_\{s\_\{t\}\}\\leftarrow\\max\(0\.001,w\_\{s\_\{t\}\}\+\\alpha\)whereα=0\.08\\alpha=0\.08if win,−0\.04\-0\.04if loss\. Renormalize to sum 1\.
Protocol requirements: \(a\) The baseline strategys0s\_\{0\}must bestep\_by\_step\. \(b\) The learning ratesαwin=0\.08\\alpha\_\{\\text\{win\}\}\{=\}0\.08,αlose=0\.04\\alpha\_\{\\text\{lose\}\}\{=\}0\.04are fixed\. Report any deviation\. \(c\) The weight floor is0\.0010\.001\. \(d\)R=30R\{=\}30rounds per phase\. Report round count if varied\. \(e\) Random seed must be recorded\.
### 2\.4Task Design
Protocol requirements: \(a\) Minimum 8 text\-domain tasks and 8 visual\-adjacent tasks\. \(b\) Tasks must be documented verbatim in the output manifest\. \(c\) The reference snapshot uses the task set in Appendix[A\.2](https://arxiv.org/html/2607.00297#A1.SS2)\. Researchers may substitute domain\-specific tasks but must report them in full\.
### 2\.5Metric Computation
γ\\gamma\(Coupling Coefficient\):
γA→B=‖𝐰A→B−𝐰B‖2‖𝐰B‖2\\gamma\_\{A\\to B\}=\\frac\{\\\|\\mathbf\{w\}\_\{A\\to B\}\-\\mathbf\{w\}\_\{B\}\\\|\_\{2\}\}\{\\\|\\mathbf\{w\}\_\{B\}\\\|\_\{2\}\}\(1\)where𝐰B\\mathbf\{w\}\_\{B\}is the pure\-domain weight vector and𝐰A→B\\mathbf\{w\}\_\{A\\to B\}is the post\-coupling vector\.
JSD \(Jensen\-Shannon Divergence\): Computed in baseeebetween𝐰T→V\\mathbf\{w\}\_\{T\\to V\}and𝐰V\\mathbf\{w\}\_\{V\}\(forT→VT\{\\to\}Vcoupling\) and between𝐰V→T\\mathbf\{w\}\_\{V\\to T\}and𝐰T\\mathbf\{w\}\_\{T\}\(forV→TV\{\\to\}Tcoupling\)\.
ECE \(Expected Calibration Error\): Optional\. Bins strategies by evaluator win rate, measures\|mean\(win\_rate\)−mean\(accuracy\)\|\|\\text\{mean\(win\\\_rate\)\}\-\\text\{mean\(accuracy\)\}\|per bin\. Requires ground\-truth task accuracy\.
Brier Score: Optional\. Mean squared error between per\-strategy win rate and accuracy\.
Protocol requirements: \(a\)γ\\gammais the minimum required metric\. \(b\) JSD must be reported if weight vectors are stored\. \(c\) ECE and Brier are optional but recommended\. \(d\) Report per\-seed values, not just means\. \(e\) Report zero\-coupling rate \(% of seeds withγ=0\\gamma\{=\}0\)\. \(f\) Report 95% bootstrap confidence intervals \(2,000 resamples at the seed level\) forγ\\gammaand JSD means\. \(g\) RecommendN≥10N\\geq 10seeds for initial screening,N≥30N\\geq 30for publication\-grade estimates targeting a CI width≤0\.1\\leq 0\.1onγ\\gamma\. The reference snapshot reportsNNper condition to enable sample\-size\-aware interpretation\.
### 2\.6Evaluator Prompt Specification
The evaluator pairwise comparison prompt must be included verbatim in the output manifest\. The reference implementation uses:
> Evaluate\. Task: \{task\} A \(\{strategy\_name\}\): \{response\_A\[:300\]\} B \(step\_by\_step\): \{response\_B\[:300\]\} Better? Output only A or B\.
Protocol requirements: \(a\) The evaluator prompt must be documented in the manifest\. \(b\) Decoding settings \(temperature, max\_tokens, top\-p, stop sequences\) must be reported\. The reference implementation uses temperature=0\.0\{=\}0\.0, max\_tokens=10\{=\}10\. \(c\) If the evaluator output cannot be parsed as “A” or “B”, the trial must be recorded as a tie and excluded from the weight update, with the tie rate reported in the manifest\. \(d\) Chain\-of\-thought or reasoning prefixes in evaluator output must be disabled \(temperature=0\.0\{=\}0\.0, no system prompt encouraging explanation\)\. Researchers who modify the prompt or decoding settings must tag their results asEPC\-v1\.0\-AltPrompt\.
### 2\.7Design Rationale
Whyγ\\gammauses L2 normalization by‖𝐰B‖2\\\|\\mathbf\{w\}\_\{B\}\\\|\_\{2\}\. The L2 norm preserves the Euclidean geometry of the probability simplex and is directly interpretable as relative distance\. While bounded divergences \(JSD, total variation, Hellinger\) are more robust for cross\-condition comparison, our empiricalγ\\gamma\-JSD correlation \(r=0\.969r\{=\}0\.969acrossN=152N\{=\}152paired observations\) confirms thatγ\\gammatracks JSD faithfully in practice\. The protocol mandatesγ\\gammaas the minimum metric and strongly recommends JSD for cross\-condition reporting\. Researchers may substitute alternative distance measures but must report theγ\\gammavalue alongside for comparability\.
Whyαwin=0\.08\\alpha\_\{\\text\{win\}\}\{=\}0\.08,αlose=0\.04\\alpha\_\{\\text\{lose\}\}\{=\}0\.04\. These values were chosen to balance learning speed against stability: the asymmetry \(αwin\>αlose\\alpha\_\{\\text\{win\}\}\>\\alpha\_\{\\text\{lose\}\}\) reflects the conservative prior that evaluator preferences are noisy and that false positives \(rewarding a strategy the evaluator does not genuinely prefer\) should be corrected more aggressively than false negatives\. The specific values were calibrated in prior workLiu \([2026a](https://arxiv.org/html/2607.00297#bib.bib1)\)to produce measurable concentration \(PCI≈0\.5\\text\{PCI\}\\approx 0\.5–1\.51\.5\) withinR=30R\{=\}30rounds without early collapse to a single strategy\.A systematic hyperparameter sensitivity analysis has not been conducted\.The symmetric learning rate variant \(αwin=αlose=0\.06\\alpha\_\{\\text\{win\}\}\{=\}\\alpha\_\{\\text\{lose\}\}\{=\}0\.06\) has been tested on GPT\-4o and produced zero coupling in all 8 repetitions—but this result coincides with a documented evaluator version drift window and cannot be attributed to the learning rate alone without a version\-locked replication\. The protocol encourages researchers to explore theα\\alphaspace and report alternative settings asEPC\-v1\.0\-AltLR\. The floor at0\.0010\.001prevents weight starvation \(strategies becoming unselectable\) while having negligible effect on the final weight distribution\.
### 2\.8Conformance and Extensibility
Conformance test suite\. A reference test suite with mock evaluators \(deterministic, coin\-flip, and scripted\-preference\) is provided with the protocol implementation\. The test suite verifies that independent implementations produce identicalγ\\gammaand JSD values on fixed input sequences, covering edge cases \(all weights at floor, uniform initial distribution, single\-strategy dominance, exact ties\)\. Passing the conformance suite qualifies an implementation asEPC\-v1\.0\-compatible\.
Protocol variants\. The protocol is designed to be extensible\. Researchers who modify core parameters \(learning rates, baseline strategy, round count, strategy set\) must tag their results with the variant label \(e\.g\.,EPC\-v1\.0\-AltLR,EPC\-v1\.0\-AltBaseline\) and report all deviations from the reference specification in the manifest\. This enables community exploration of alternative configurations while preserving comparability through a core conformance path\.
Every EPC measurement must produce a machine\-readable manifest containing:
1. 1\.Protocol version:EPC\-v1\.0
2. 2\.Evaluator: model identifier, API endpoint, measurement date \(YYYY\-MM\-DD\)
3. 3\.Executor: model identifier, API endpoint
4. 4\.Configuration:RR,αwin\\alpha\_\{\\text\{win\}\},αlose\\alpha\_\{\\text\{lose\}\}, random seed, number of strategies
5. 5\.Task set: verbatim task list
6. 6\.Strategy set: verbatim strategy prompts
7. 7\.Results: per\-seedγT→V\\gamma\_\{T\\to V\},γV→T\\gamma\_\{V\\to T\}, JSD \(if available\), zero\-coupling rate, weight vectors \(strongly recommended\)
A JSON schema is provided in the protocol implementation \(epc\_manifest\_schema\.json\)\.
## 3Reference Snapshot v1\.0
This section provides the v1\.0 reference baseline, derived from measurements spanning May–June 2026 across five independent studiesLiu \([2026a](https://arxiv.org/html/2607.00297#bib.bib1)\)\. All values areversion\-boundand expected to decay\.
Table 1:EPC Reference Snapshot v1\.0 — Cross\-model evaluation conditions\. All measurements May–June 2026\. Values expected to expire as evaluator models update\.Table 2:EPC Reference Snapshot v1\.0 — Multi\-gateway replications \(June 27, 2026\)\.Table 3:EPC Reference Snapshot v1\.0 — Calibration baselines \(self\-evaluation,N=10N\{=\}10,R=16R\{=\}16\)\.### 3\.1Snapshot Validity Statement
These values were measured between May 27 and June 27, 2026\.GPT\-4o measurements were obtained via third\-party API gateways \(api2d, DMXAPI\) and have not been replicated via direct OpenAI API\. Qwen measurements use different model versions \(qwen3\.7\-plus vs\. qwen\-plus\) within the same provider ecosystem\. All values are snapshot measurements conditional on specific, now\-deprecated model versions\. The GPT\-4o May\-to\-June drift \(Table[1](https://arxiv.org/html/2607.00297#S3.T1), rows 1 and 4\) demonstrates that coupling measurements can invert within 4 weeks\.Users of this snapshot must check whether the evaluator versions listed above are still current\.
## 4Versioning Convention
EPC baselines follow a three\-component versioning scheme:
vX\.Y\-Z
where:
- •X: Protocol major version\. Incremented on incompatible protocol changes\.
- •Y: Snapshot version\. Incremented on new measurements of the same evaluator\.
- •Z: Evaluator generation\. Encodes the evaluator model generation \(e\.g\.,GPT4o\-0806,Qwen3\.7\-0526\)\.
Example:v1\.2\-GPT4o\-0806= EPC protocol v1, second snapshot of GPT\-4o \(August 2024 checkpoint\)\.
Community\-contributed snapshots follow the same convention\. A submission template is provided with the protocol implementation\.
## 5Usage Guide
### 5\.1Adoption
1. 1\.Clone the protocol implementation repository\.
2. 2\.Configure API credentials for your executor and evaluator\.
3. 3\.Runpython epc\_protocol\.py \-\-evaluator YOUR\_MODEL \-\-executor YOUR\_MODEL\.
4. 4\.The script produces a manifest JSON conforming to the output schema\.
5. 5\.Compare your results against the Reference Snapshot v1\.0 \(Tables[1](https://arxiv.org/html/2607.00297#S3.T1)–[3](https://arxiv.org/html/2607.00297#S3.T3)\), noting version differences\.
### 5\.2Interpretation Guidelines
- •γ\>0\.5\\gamma\>0\.5: Substantial cross\-domain coupling\. The evaluator’s preferences significantly influence the agent’s strategy distribution across task domains\.
- •γ<0\.2\\gamma<0\.2: Weak coupling\. The agent maintains domain\-appropriate strategies\.
- •Zero\-coupling rate\>50%\>50\\%: The evaluator may lack discriminative capacity \(floor effect\), or the protocol may have converged to a single strategy\.
- •ECE\>0\.2\>0\.2: Evaluator preferences are substantially miscalibrated relative to strategy quality\.
### 5\.3Known Pitfalls
1. 1\.Version decay: All measurements are bound to specific evaluator versions\. Re\-measure after known model updates\.
2. 2\.Proxy confounding: Measurements obtained via third\-party API gateways may reflect gateway behavior rather than model behavior\. Prefer official API endpoints\.
3. 3\.Format confound: PCI partially reflects output\-length preference \(ρagg=0\.89\\rho\_\{\\text\{agg\}\}\{=\}0\.89atn=6n\{=\}6,ρinst=0\.219\\rho\_\{\\text\{inst\}\}\{=\}0\.219atn=60n\{=\}60\)\. Interpret PCI as a preference\-convergence metric \(format \+ reasoning\)\.
4. 4\.Floor effects: Near\-zero coupling in self\-evaluation may reflect evaluator incapacity \(ECE=0\.31\{=\}0\.31\) rather than genuine stability\.
5. 5\.Small\-NNinstability: Coupling estimates fromN<10N\{<\}10seeds have wide confidence intervals\. The reference snapshot reportsNNexplicitly for each condition\.
6. 6\.Task sensitivity: The protocol uses 8 text \+ 8 text\-proxied visual tasks\. Coupling strength may depend on task domain\. Report task sets verbatim\.
## 6Relation to Prior Work
### 6\.1Standardized Agent Evaluation Protocols
The ML community is actively building standardized evaluation infrastructure\. AgentBeatsAgentBeats \([2026](https://arxiv.org/html/2607.00297#bib.bib9)\)proposes Agentified Agent Assessment \(AAA\), using standardized protocols \(A2A, MCP\) to decouple assessment logic from agent implementation\. The Holistic Agent Leaderboard \(HAL\)HAL \([2026](https://arxiv.org/html/2607.00297#bib.bib10)\)provides a standardized evaluation harness orchestrating parallel evaluations across hundreds of VMs\. A unified agent evaluation frameworkZhu et al\. \([2026](https://arxiv.org/html/2607.00297#bib.bib5)\)converts diverse benchmarks into a standardized instruction–tool–environment format\. These efforts focus onagent capability benchmarking—measuring how well agents perform tasks\. EPC differs in measuringevaluator preference dynamics—how evaluator biases propagate through the agent’s strategy distribution in closed feedback loops\. The two are complementary: capability benchmarks assess agent performance, while EPC assesses evaluator influence\.
### 6\.2Versioned and Continuous Evaluation
Static benchmarks decay rapidly\. An audit of 26 AI benchmarks found a median longevity score of 5/100BenchRisk \([2026](https://arxiv.org/html/2607.00297#bib.bib11)\)\. The community response is continuous, versioned evaluation infrastructure\. SWE\-rebenchSWE\-rebench \([2025](https://arxiv.org/html/2607.00297#bib.bib12)\)continuously mines new GitHub pull requests created after model training cutoffs, directly addressing contamination and saturation\. MLCommons AILuminateMLCommons \([2026](https://arxiv.org/html/2607.00297#bib.bib13)\)implements continuous prompt stewardship with per\-prompt quality metrics, reserve prompt rotation, and tiered community contributor trust levels\. HuggingFace Community EvalsHF Community \([2026](https://arxiv.org/html/2607.00297#bib.bib16)\)enables distributed, auditable evaluation through pull\-request\-based submissions\. EPC joins this movement with a versioning convention \(vXX\.YY\-ZZ\) that makes measurement decay explicit and a snapshot submission protocol that enables community contributions as evaluator models evolve\.
### 6\.3Evaluator Reliability and Drift
Prior work on evaluator reliability spans several complementary approaches\. LLM\-as\-judge studies have documented systematic biases including position bias, verbosity bias, and self\-preference amplificationZheng et al\. \([2023](https://arxiv.org/html/2607.00297#bib.bib14)\); Li et al\. \([2024](https://arxiv.org/html/2607.00297#bib.bib15)\)\. Drift detection and attribution methodsLi \([2026](https://arxiv.org/html/2607.00297#bib.bib4)\)disambiguate whether score changes originate from the system or the judge\. IRT\-based intrinsic consistency diagnostics measure a single judge’s internal reliability across prompts\. Rubric\-locking strategies constrain the evaluator’s output space to improve stability\. These approaches focus onupstream evaluator calibration—ensuring the evaluator is internally consistent before deployment\. EPC addresses thedownstream couplingthat emerges when an agent adapts its strategy against evaluator feedback in a closed loop\. The protocol is designed to be paired with upstream reliability diagnostics: a well\-calibrated evaluator can still induce preference coupling through repeated interaction, and EPC provides the standardized measurement to detect this\. Alternative closed\-loop learning rules—including bandit algorithms, Bradley\-Terry preference models, and Bayesian update schemes—may offer different noise\-robustness properties; the protocol’s variant tagging system \(EPC\-v1\.0\-AltLR\) is designed to accommodate community exploration of these alternatives while preserving comparability through a core conformance path\.
## 7Limitations
The protocol inherits the limitations of the underlying TTRL methodology: the asymmetric learning rate \(αwin\>αlose\\alpha\_\{\\text\{win\}\}\{\>\}\\alpha\_\{\\text\{lose\}\}\) may inflate coupling magnitude; the fixed baseline strategy \(step\_by\_step\) introduces a structural bias; and the text\-proxied visual tasks limit ecological validity for multimodal settings\. The reference snapshot reflects a specific time window \(May–June 2026\) and is known to be stale for current GPT\-4o versions\. Community contributions are needed to maintain currency\.
## 8Conclusion
We have specified EPC—a standardized protocol for measuring evaluator preference coupling in LLM agent systems—and provided a versioned reference snapshot v1\.0 as an initial calibration point\. The protocol, implementation, reference data, and community submission template are released as open infrastructure\. We encourage the community to contribute updated snapshots as evaluator models evolve, and to extend the protocol to additional evaluator families, task domains, and coupling metrics\.
## Broader Impact Statement
Standardizing evaluator coupling measurement enables the community to detect when proprietary evaluator behavior changes, potentially preventing the deployment of agents with distorted strategy distributions\. The versioning convention makes measurement decay explicit, reducing the risk of relying on stale baselines\. The protocol does not introduce new capabilities; it standardizes existing measurement methodology\. All reference data are anonymized and obtained via publicly accessible API endpoints\.
## Reproducibility Statement
The complete protocol implementation is released as open\-source Python code \(3\.8\+, no GPU required\)\. The reference snapshot data are provided in machine\-readable JSON\. The protocol specification in §2 is designed to be independently implementable without access to the reference implementation\. All API calls used in the reference snapshot are documented with model identifiers, endpoints, and dates\.
## References
- Liu \(2026a\)Anonymous\.A Diagnostic Framework and Multi\-Evaluator Audit of Evaluator\-Driven Preference Dynamics\.TMLR submission, 2026\.
- Liu \(2026b\)Anonymous\.Contagion Networks: Evaluator Bias Propagation in Multi\-Agent LLM Systems\.arXiv:2606\.20493, 2026\.
- Liu \(2026c\)Anonymous\.Memory Contagion: Cross\-Temporal Propagation of Evaluator Bias via Agent Memory\.arXiv:2606\.23195, 2026\.
- Li \(2026\)Y\. Li\.Who Drifted: the System or the Judge?arXiv:2606\.15474, 2026\.
- Zhu et al\. \(2026\)P\. Zhu et al\.A Unified Framework for the Evaluation of LLM Agentic Capabilities\.arXiv:2605\.27898, 2026\.
- Tang et al\. \(2026\)Z\. Tang et al\.Stop Comparing LLM Agents Without Disclosing the Harness\.Position paper, 2026\.
- Pluralistic \(2026\)Pluralistic Leaderboards\.arXiv preprint, 2026\.
- Guo et al\. \(2017\)C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\.On Calibration of Modern Neural Networks\.ICML, 2017\.
- AgentBeats \(2026\)AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility\.June 2026\.
- HAL \(2026\)Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation\.ICLR, 2026\.
- BenchRisk \(2026\)BenchRisk: An Independent Framework for Assessing AI Benchmark Longevity\.MLCommons, April 2026\.
- SWE\-rebench \(2025\)SWE\-rebench: Live, Decontaminated Coding Benchmark\.NeurIPS, 2025\.
- MLCommons \(2026\)AILuminate Continuous Prompt Stewardship System\.MLCommons, April 2026\.
- Zheng et al\. \(2023\)L\. Zheng, W\.\-L\. Chiang, Y\. Sheng, et al\.Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.NeurIPS, 2023\.
- Li et al\. \(2024\)X\. Li, T\. Zhang, Y\. Dubois, et al\.AlpacaEval: An Automatic Evaluator of Instruction\-following Models\.ICLR, 2024\.
- HF Community \(2026\)HuggingFace Community Evals\.February 2026\.
## Appendix AAppendix
### A\.1Strategy Definitions
Table 4:The 11 strategies used in the EPC protocol\. All strategies use natural\-language prompt prefixes\.
### A\.2Reference Task Set
Text tasks: \(1\) Explain why the sky is blue\. \(2\) What are pros and cons of remote work? \(3\) Describe how a computer processes information\. \(4\) Why do leaves change color in autumn? \(5\) Explain the concept of supply and demand\. \(6\) What is the difference between weather and climate? \(7\) How does a vaccine work? \(8\) Explain the water cycle in nature\.
Visual\-adjacent tasks: \(1\) Describe composing a sunset photograph\. \(2\) Explain how to recognize symmetry in architecture\. \(3\) What makes a painting visually balanced? \(4\) Describe how colors interact in a color wheel\. \(5\) How would you explain perspective in drawing? \(6\) What visual elements make a logo memorable? \(7\) Visual difference between a circle and a sphere\. \(8\) How does lighting affect the mood of a photograph?
### A\.3Manifest JSON Schema
The complete JSON schema for EPC measurement manifests is provided in the protocol implementation repository\. Key fields:protocol\_version,evaluator,executor,config,tasks,strategies,results\(per\-seedγ\\gamma, JSD, weights\)\.Similar Articles
Agent Evaluation: A Detailed Guide (53 minute read)
A comprehensive guide on evaluating LLM-based agent systems, covering fundamental concepts, evaluation frameworks, and case studies from recent benchmarks.
Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
This paper presents the first study of probability calibration as a mitigation for evaluator preference coupling in LLM agent feedback loops, showing that calibrated evaluator judgments reduce coupling coefficients by 20-49% and divergence by 45-67%.
PACE: A Proxy for Agentic Capability Evaluation
This paper introduces PACE, a framework that predicts expensive LLM agent benchmark scores using a small subset of cheaper non-agentic evaluation instances, achieving high accuracy at less than 1% of the cost.
Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions
This empirical survey extends prior work on the bias-reliability tradeoff in LLM evaluation by measuring evaluator coupling, strategy diversity, and small-sample reliability across 11 conditions, confirming that low evaluator influence leads to high measurement noise while strong coupling reduces diversity and noise.
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
MetaEvo proposes a two-stage framework for continual evolution of LLM-based agents, using preference-based optimization to enhance principle abstraction and modular architecture for experience reuse, outperforming strong baselines on reasoning benchmarks.