Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

arXiv cs.CL Papers

Summary

This paper investigates whether early-token confidence signals from LLM decoding can predict reasoning quality in multi-agent debate systems, finding that confidence in the first few generated tokens is the strongest predictor of rubric-based essay scores.

arXiv:2606.10307v1 Announce Type: new Abstract: Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:10 AM

# Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate
Source: [https://arxiv.org/html/2606.10307](https://arxiv.org/html/2606.10307)
Ali Keramati, Justin Cheok , Jacob Horne and Mark Warschauer University of California, Irvine \{a\.kera,jcheok,jhorne1,markw\}@uci\.edu

###### Abstract

Evaluating reasoning quality in multi\-agent LLM systems is challenging, especially for open\-ended tasks without reference answers\. We investigate whether intrinsic confidence signals, token\-level log\-probabilities from decoding, can predict reasoning quality as assessed by LLM\-as\-judge evaluation\. Using a debate\-based essay scoring framework, we compare confidence proxies against rubric\-based judge scores across two ASAP essay sets\. We find that early\-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full\-sequence statistics\. Analysis of log\-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative\. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique\. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi\-agent LLM systems\.

Early\-Token Confidence Predicts Reasoning Quality in Multi\-Agent LLM Debate

Ali Keramati, Justin Cheok , Jacob Horne and Mark WarschauerUniversity of California, Irvine\{a\.kera,jcheok,jhorne1,markw\}@uci\.edu

## 1Introduction

Recent advances in large language models \(LLMs\) have enabled the development of*multi\-agent systems*, in which multiple specialized agents collaborate to solve complex tasksWuet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib10)\)\. By decomposing problems into role\-specific subtasks, such systems have been shown to improve performance, robustness, and consistency across a range of applications, including reasoning, planning, and automated decision\-makingParmaret al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib9)\); Hanet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib8)\)\. Among interaction paradigms,*debate*has emerged as a particularly effective mechanism: by eliciting both supporting and opposing arguments, it encourages exploration of diverse reasoning paths and exposes errors that may remain hidden in a single\-agent trajectoryDuet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib11)\)\.

Rubric\-based scoring provides a concrete and high\-impact setting in which these benefits are especially relevant\. In this setting, a system assigns scores according to a predefined rubric that specifies evaluation criteria and score rangesFallahet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib23)\)\. A canonical example is*automated essay scoring \(AES\)*, where models aim to approximate human judgments of student writing qualityDikli \([2006](https://arxiv.org/html/2606.10307#bib.bib13)\)\. Public benchmarks such as the ASAP111https://www\.kaggle\.com/c/asap\-aes/datadataset include prompts that provide trait\-level rubric scores \(rather than a single holistic score\), enabling trait\-specific feedback and analysisCrossleyet al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib12)\)\. At the same time, recent work has explored using LLMs directly for essay scoring, highlighting both the promise of scalable rubric\-based evaluation and the need to better understand the reliability of LLM\-driven scoring behaviorPacket al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib14)\)\.

Multi\-agent debate is a natural fit for rubric scoring because it produces inspectable intermediate reasoning artifactsKeramati and Warschauer \([2025](https://arxiv.org/html/2606.10307#bib.bib15)\)\. In these systems, agents adopt complementary roles, generating diverse perspectives on the same input\. This structured disagreement can help the system consider alternative interpretations of the rubric and mitigate single\-path scoring bias by forcing explicit engagement with counterevidenceDuet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib11)\)\. However, debate also increases system complexity: multiple agents, multiple messages, and multiple opportunities for subtle procedural failuresWynnet al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib16)\)\. As multi\-agent pipelines become more elaborate, it becomes essential to add an evaluation layer that measures not only whether a final score matches a reference, but also whether agents’ reasoning is high quality and reliableChenet al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib18)\)\.

A growing body of work addresses this need through*LLM\-as\-judge*evaluation, where a separate language model is used to score generated outputs along predefined criteria\. This paradigm has become a scalable alternative to human evaluation, particularly for open\-ended tasks where reference answers are unavailableZhenget al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib17)\)\. However, LLM\-as\-judge provides only an*external*signal of quality, and an important open question remains:to what extent do these judgments reflect the true reliability of the underlying reasoning process?In particular, can we identify*intrinsic signals*within the generating model that correlate with externally judged reasoning quality?

To connect judge\-based reasoning evaluation with model\-intrinsic signals, we turn to*confidence estimation*and*uncertainty quantification*Kanget al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib20)\)\. Neural probabilities are not automatically calibrated, and language model confidence can be misaligned with correctness\. Nevertheless, recent research shows that language models can provide meaningful self\-evaluations under appropriate formats and that uncertainty estimation for LLM generation is an active area of studyMaviet al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib19)\)\. In this work, we operationalize model confidence using token\-level log\-probabilities produced during decoding\. Intuitively, if an agent follows a more coherent and evidentially grounded reasoning path, the model should assign higher probability mass to the tokens it generates along that path, yielding more confident logprob trajectories\.

## 2Related Work

##### LLM\-as\-Judge Evaluation\.

Recent work has established*LLM\-as\-judge*as a practical paradigm for evaluating open\-ended generation in settings where reference answers are weak or unavailable\. Prior studies show that strong language models can correlate well with human judgments on instruction\-following and related tasks, making them a scalable alternative to manual evaluationChiang and Lee \([2023](https://arxiv.org/html/2606.10307#bib.bib21)\); Duboiset al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib22)\); Zhenget al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib17)\); Fuet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib24)\); Liuet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib25)\)\. Beyond coarse pairwise or scalar judgments, subsequent work emphasizes the need for more structured and interpretable evaluation\. For example, FLASK introduces fine\-grained, rubric\-based assessment and demonstrates improved interpretability and reliability compared to skill\-agnostic scoringYeet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib26)\)\.

Despite these advances, a growing body of meta\-evaluation work highlights fundamental limitations of LLM judges\. Prior studies document systematic biases such as verbosity and positional bias, limited self\-consistency, and sensitivity to prompt design and evaluation protocolsWanget al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib27)\); Zenget al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib28)\); Zhenget al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib17)\); Liuet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib25)\)\. In response, recent approaches propose more elaborate judging strategies, including chain\-of\-thought and decomposition\-based evaluation, multi\-aspect scoring, reference\-based comparisons, and multi\-agent or debate\-style evaluators such as PRD and ChatEvalGong and Mao \([2023](https://arxiv.org/html/2606.10307#bib.bib29)\); Sahaet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib30)\); Liet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib31)\); Chanet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib32)\); Jeonget al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib33)\)\. However, evidence on the effectiveness of these methods remains mixed\. REIFE shows that gains from evaluation protocols depend strongly on the base model and dataset, underscoring the need for diverse and well\-calibrated evaluation setupsLiuet al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib34)\)\. Similarly, Huang et al\. demonstrate that fine\-tuned judge models \(e\.g\., JudgeLM, PandaLM, Auto\-J, Prometheus\) often fail to generalize beyond their training domain, behaving more like task\-specific classifiers than robust evaluatorsHuanget al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib35)\)\.

##### Confidence and Uncertainty in LLMs\.

A parallel line of work investigates whether*intrinsic confidence signals*can be used to assess the reliability of LLM outputs\. Research in calibration and uncertainty quantification shows that neural probabilities are informative but not inherently well calibrated, meaning that high confidence does not always correspond to correctnessDesai and Durrett \([2020](https://arxiv.org/html/2606.10307#bib.bib36)\); Kadavathet al\.\([2022](https://arxiv.org/html/2606.10307#bib.bib37)\); Quevedoet al\.\([2024](https://arxiv.org/html/2606.10307#bib.bib38)\)\. Nevertheless, token\-level probabilities remain one of the most direct signals available during generation, and have been widely used to detect hallucinations, factual inconsistencies, and uncertain outputs through log\-probability\- and entropy\-based featuresLiuet al\.\([2022](https://arxiv.org/html/2606.10307#bib.bib39)\); Manakulet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib40)\); Mallenet al\.\([2023](https://arxiv.org/html/2606.10307#bib.bib41)\)\.

In addition, work on self\-evaluation suggests that LLMs can sometimes produce useful confidence estimates in natural language, though these verbalized signals may diverge from underlying model uncertainty, particularly in multi\-step reasoning settingsKadavathet al\.\([2022](https://arxiv.org/html/2606.10307#bib.bib37)\); Maviet al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib19)\)\. Recent surveys therefore advocate for scalable uncertainty estimation methods that combine intrinsic decoding\-time signals with downstream evaluation metricsKanget al\.\([2025](https://arxiv.org/html/2606.10307#bib.bib20)\)\.

## 3Methodology

Figure[1](https://arxiv.org/html/2606.10307#S3.F1)provides an overview of our framework, which builds upon the multi\-agent debate architecture introduced in prior workKeramati and Warschauer \([2025](https://arxiv.org/html/2606.10307#bib.bib15)\)and extends it with an LLM\-as\-judge meta\-evaluation module for reasoning analysis\. In the first stage, anAdvocateand aSkepticproduce opposing arguments for a given essay–rubric pair while exposing token\-level log\-probabilities as intrinsic confidence signals\. In the second stage, a separate meta\-evaluator scores each argument along rubric\-based dimensions such as instruction following, justification quality, and evidence grounding\. This design enables systematic analysis of the relationship between internal confidence signals and externally judged reasoning quality\.

![Refer to caption](https://arxiv.org/html/2606.10307v1/img/methodology_figure_v2.png)Figure 1:Overview of the proposed multi\-agent debate and LLM\-as\-judge evaluation framework\.### 3\.1Problem Setting

Letℰ\\mathcal\{E\}denote the set of essays andℛ\\mathcal\{R\}the set of rubric traits\. Each essaye∈ℰe\\in\\mathcal\{E\}consists of unstructured text together with optional metadata, and each rubric traitr∈ℛr\\in\\mathcal\{R\}specifies a textual description and a scoring range\[mr,Mr\]\[m\_\{r\},M\_\{r\}\]\. For each essay–trait pair\(e,r\)\(e,r\), the debate system produces a transcript

τ​\(e,r\)=\(a,k\),\\tau\(e,r\)=\(a,\\,k\),whereaais the Advocate’s argument andkkis the Skeptic’s rebuttal\. Both arguments are generated by a language model conditioned on the essay, rubric, and conversation history; the model simultaneously produces token\-level log\-probabilities reflecting its internal confidence over candidate continuations\.

Given a collection of debate responses\{\(ai,ki\)\}\\\{\(a\_\{i\},k\_\{i\}\)\\\}, each paired with confidence signalscic\_\{i\}and meta\-evaluation scoresqiq\_\{i\}, our objective is to analyze whether token\-level probability signals correlate with the externally judged quality of agent reasoning, and thus whether intrinsic confidence can serve as an indicator of reasoning reliability in multi\-agent LLM systems\.

### 3\.2Agents and Roles

The debate framework comprises three specialized agents that interact sequentially for each essay–trait pair\. TheAdvocateinitiates the debate by constructing an argument that highlights the essay’s strengths relative to the rubric trait, drawing exclusively on supporting evidence from the essay text without assigning a score\. TheSkepticresponds by identifying limitations or shortcomings in the essay with respect to the same criterion, producing an evidence\-based rebuttal that challenges the Advocate’s claims, again without scoring\. TheSynthesizer\-Judge Scorerreads the completed transcript and produces the final trait\-level score within the allowed rubric range\. Because this agent performs a constrained decision\-making task whose output can be evaluated directly against ground\-truth scores using accuracy\-based metrics, it falls outside the scope of the present study\. Our analysis focuses exclusively on the open\-ended reasoning produced by the Advocate and Skeptic\. Full system prompts for all three agents are provided inAppendix[C](https://arxiv.org/html/2606.10307#A3)\.

### 3\.3Confidence Signals from Token Log\-Probabilities

We estimate model confidence using token\-level log\-probabilities obtained during generation\. For a generated response ofTTtokens, the model produces a log\-probability at each decoding step:

ℓt=log⁡p​\(tt∣t<t,x\),\\ell\_\{t\}=\\log p\(t\_\{t\}\\mid t\_\{<t\},\\,x\),wherexxis the prompt context andt<tt\_\{<t\}the preceding tokens\. The resulting sequenceL=\(ℓ1,…,ℓT\)L=\(\\ell\_\{1\},\\dots,\\ell\_\{T\}\)forms a log\-probability trajectory over the full response\.

#### 3\.3\.1Window\-Based Segmentation

Rather than summarizingLLwith a single statistic, we extract contiguous sub\-sequences to examine how confidence evolves across different phases of generation\. We use two complementary strategies:

##### Fixed\-length windows\.

For window sizekk:

Wfirst​\(k\)=\(ℓ1,…,ℓk\),W\_\{\\text\{first\}\}\(k\)=\(\\ell\_\{1\},\\dots,\\ell\_\{k\}\),\\qquadWlast​\(k\)=\(ℓT−k\+1,…,ℓT\)\.W\_\{\\text\{last\}\}\(k\)=\(\\ell\_\{T\-k\+1\},\\dots,\\ell\_\{T\}\)\.

##### Percentage\-based windows\.

To normalize across responses of varying length, we define windows as a fractionα∈\(0,1\]\\alpha\\in\(0,1\]of the total response:

Wfirst​\(α\)=\(ℓ1,…,ℓ⌊α​T⌋\),W\_\{\\text\{first\}\}\(\\alpha\)=\(\\ell\_\{1\},\\dots,\\ell\_\{\\lfloor\\alpha T\\rfloor\}\),Wlast​\(α\)=\(ℓT−⌊α​T⌋\+1,…,ℓT\)\.W\_\{\\text\{last\}\}\(\\alpha\)=\(\\ell\_\{T\-\\lfloor\\alpha T\\rfloor\+1\},\\dots,\\ell\_\{T\}\)\.

#### 3\.3\.2Statistical Aggregation

For each windowWW, we compute the following summary statistics:

##### Mean and median\.

μW=1\|W\|​∑ℓ∈Wℓ,μ~W=median​\(W\)\.\\mu\_\{W\}=\\frac\{1\}\{\|W\|\}\\sum\_\{\\ell\\in W\}\\ell,\\qquad\\tilde\{\\mu\}\_\{W\}=\\mathrm\{median\}\(W\)\.The mean reflects overall token likelihood; the median provides a robust central\-tendency estimate less sensitive to outlier tokens\.

##### Minimum and maximum\.

min⁡\(W\)\\min\(W\)andmax⁡\(W\)\\max\(W\)bound the range of token confidence within the segment\.

##### Variance, standard deviation, and range\.

σW2=1\|W\|​∑ℓ∈W\(ℓ−μW\)2,\\sigma\_\{W\}^\{2\}=\\frac\{1\}\{\|W\|\}\\sum\_\{\\ell\\in W\}\(\\ell\-\\mu\_\{W\}\)^\{2\},rangeW=max⁡\(W\)−min⁡\(W\)\.\\text\{range\}\_\{W\}=\\max\(W\)\-\\min\(W\)\.
These statistics quantify the dispersion and volatility of the generation process within a window\.

##### Trajectory slope\.

We fit a linear regression to the log\-probability sequence over the window:

ℓt≈a​t\+b\.\\ell\_\{t\}\\approx a\\,t\+b\.The slope coefficientaacaptures directional trends:a\>0a\>0indicates growing confidence across the segment, whilea<0a<0indicates declining confidence\.

### 3\.4LLM\-as\-Judge Meta\-Evaluation

Because the Advocate and Skeptic generate open\-ended argumentative reasoning rather than discrete labels, their outputs cannot be evaluated with reference\-based metrics such as accuracy or n\-gram overlap\. We therefore introduce a secondary evaluation stage in which a separate language model judges the quality of each agent’s reasoning along rubric\-based dimensions\.

#### 3\.4\.1Prompt Reconstruction

For each agent response, we reconstruct the complete prompt context the agent originally received, consisting of: \(i\) the agent’s system instructions describing its role and behavioral constraints, \(ii\) the rubric trait definition, \(iii\) the essay text, and \(iv\) the agent’s generated response\. Supplying this full context enables the evaluator to assess both role adherence and the appropriateness of the evidence used\.

#### 3\.4\.2Evaluation Dimensions

The meta\-evaluator scores each response along three dimensions:

##### Instruction Following\.

Whether the agent maintained its assigned role throughout and avoided prohibited behaviors\.

##### Justification Quality\.

Whether claims are supported by explicit reasoning that coherently links evidence to conclusions\.

##### Evidence Grounding\.

Whether the argument references concrete, specific passages from the essay rather than relying on vague or generic statements\.

#### 3\.4\.3Scoring Protocol

Each dimension is scored on a three\-point ordinal scale \(1=Low,2=Medium,3=High1=\\text\{Low\},\\ 2=\\text\{Medium\},\\ 3=\\text\{High\}\), assigned independently\. The evaluator also raises acritical issue flagwhen the response contains a severe failure that invalidates the reasoning, including hallucinated evidence, major internal contradictions, role\-constraint violations, or incoherent output\.

We summarize reasoning quality using an aggregate score:

Q1=sinstruction\+sjustification\+sevidence\.Q\_\{1\}=s\_\{\\text\{instruction\}\}\+s\_\{\\text\{justification\}\}\+s\_\{\\text\{evidence\}\}\.
If a critical failure is detected, the aggregate score is overridden:

Q=\{0if critical issue is present,Q1otherwise\.Q=\\begin\{cases\}0&\\text\{if critical issue is present\},\\\\ Q\_\{1\}&\\text\{otherwise\}\.\\end\{cases\}
This formulation ensures that responses containing severe reasoning failures are penalized regardless of their dimension\-level scores, yielding a composite score in the range\[0,9\]\[0,9\]\.

## 4Experiments

### 4\.1Experimental Setup

##### Data\.

We evaluate our framework on the ASAP222https://www\.kaggle\.com/c/asap\-aes/datadataset, a widely used benchmark of student\-written English essays scored by trained human raters against prompt\-specific rubrics\. Although ASAP comprises eight essay sets, analytic trait\-level annotations are available only for Essay Sets 7 and 8; all experiments are therefore conducted on these two sets, which provide multiple independent human ratings per essay at the trait level\. Full dataset statistics, rubric descriptions, and label construction details are provided in Appendix[A](https://arxiv.org/html/2606.10307#A1)\.

##### Evaluation Metrics\.

For ordinal evaluation targets—instruction following, justification quality, evidence grounding, and aggregate score—we report Spearman’sρ\\rhoand Kendall’sτ\\tauto capture rank\-order agreement between confidence proxies and LLM\-as\-judge scores\. For the binarycritical flag, we report AUROC and point\-biserial correlation\. To keep results interpretable, we report only the best\-performing proxy per target–role combination\.

##### Models\.

Advocate and Skeptic responses are generated usingGPT\-4o\-mini, and meta\-evaluation is performed byGPT\-5\-miniinstance acting as the LLM\-as\-judge\. To reduce run\-to\-run variance, the meta\-evaluator decodes deterministically, while the Advocate and Skeptic use standard sampling\. Token\-level log\-probabilities are collected during Advocate and Skeptic decoding to compute the confidence proxies described in Section[3](https://arxiv.org/html/2606.10307#S3)\.

### 4\.2Cross\-Dataset Analysis

Table 1:Top\-3 confidence features for each role–target pair across Essay Sets 7 and 8\. For ordinal targets, the final column reports Spearman correlation; for critical detection, it reports AUROC\. Early\-kkrefers to statistics computed over the firstkkgenerated tokens, while final\-half refers to the last 50% of the response\. Full rankings and additional metrics are provided in Appendix[B](https://arxiv.org/html/2606.10307#A2)\.†AUROC reported for critical detection\.Table[1](https://arxiv.org/html/2606.10307#S4.T1)summarizes the top\-performing confidence features across Essay Sets 7 and 8\. While both datasets exhibit a consistent relationship between token\-level confidence and judged reasoning quality, the structure of this relationship varies notably across roles and datasets\.

For the Advocate, the dominant signal shifts from global to local confidence\. In Essay Set 7, the strongest predictors are full\-response summaries such as*full\-response median*and*final\-half median*, which consistently lead across all ordinal targets\. In contrast, Essay Set 8 shows a clear transition toward early\-generation dispersion, with*range of first 3 tokens*emerging as the top feature across all ordinal targets\. This shift suggests that the informativeness of confidence signals depends on dataset characteristics, with some settings favoring globally stable confidence while others emphasize variability at the start of generation\.

In contrast, the Skeptic exhibits a stable pattern across both datasets\. Early\-window dispersion features—particularly*range of first 3 tokens*—consistently dominate all ordinal targets, with only minor variation in secondary features such as slope\-based measures\. Correlation magnitudes are uniformly lower than for the Advocate, indicating a weaker alignment between confidence and judged quality for adversarial reasoning\.

A similar pattern holds for critical failure detection\. Advocate failures are best captured by sharp early\-token signals, such as*max of first 3 tokens*in Set 7 \(AUROC=0\.849=0\.849\) and*median of first 5 tokens*in Set 8 \(AUROC=0\.759=0\.759\), suggesting that severe errors manifest as localized confidence spikes or drops early in generation\. Skeptic detection performance is both weaker and more stable across datasets \(AUROC≈0\.63\\approx 0\.63\), with early mean\- and median\-based features performing best\.

Taken together, three findings are consistent across both essay sets\. First, early\-generation signals are broadly informative: even when not dominant \(as in Advocate Set 7\), they remain among the top\-performing features across nearly all role–target pairs\. Second, the opening phase of generation is the most diagnostically useful region, aligning with trajectory analyses that show higher variability in early tokens compared to later segments\. Third, the Advocate–Skeptic asymmetry is robust: confidence aligns more strongly with supportive reasoning than with adversarial critique, both in ordinal correlations and in critical\-failure detection\. Full rankings and additional metrics are reported in Appendix[B](https://arxiv.org/html/2606.10307#A2)\.

### 4\.3Trajectory Analysis

To understand why early\-window features consistently dominate across both datasets, we analyze token\-level log\-probability trajectories for Advocate and Skeptic responses\. Because responses vary in length, each trajectory is normalized to a 0%–100% position scale via interpolation\. For each role and dataset, we compute the mean trajectory along with 25–75 and 10–90 percentile bands across responses\.

Figure[2](https://arxiv.org/html/2606.10307#S4.F2)reveals a consistent structural pattern across both essay sets and roles: responses begin with relatively high confidence, followed by a sharp early decline, a prolonged mid\-response plateau, and a modest recovery toward the end\. Since log\-probabilities closer to zero indicate higher confidence, this pattern suggests that initial tokens are easy to predict, uncertainty increases as the model transitions into substantive reasoning, and confidence stabilizes once the response structure is established\.

A central observation is that variability is concentrated at the beginning of the response\. The percentile bands are widest in the first few tokens, indicating substantial heterogeneity in early\-generation behavior: some responses start with stable, high\-confidence trajectories, while others exhibit immediate volatility\. In contrast, the middle and later portions of the response are comparatively flat and tightly clustered\. This explains why early\-window dispersion features \(e\.g\.,*range of first 3 tokens*\) consistently emerge as strong predictors—they capture precisely the region where responses differ most in confidence\. Once generation reaches the plateau phase, trajectories become too similar for full\-response or late\-window features to remain discriminative\.

The trajectories also reveal a persistent role asymmetry\. Skeptic responses exhibit slightly lower average confidence and wider low\-confidence tails throughout generation, particularly in the lower percentile bands\. This aligns with the weaker correlations observed for Skeptic features and suggests that adversarial reasoning introduces greater variability in generation dynamics\. In contrast, Advocate responses follow more stable trajectories, making confidence a more reliable signal of judged quality\.

![Refer to caption](https://arxiv.org/html/2606.10307v1/img/Agents-7.png)

![Refer to caption](https://arxiv.org/html/2606.10307v1/img/Agents-8.png)

Figure 2:Token\-level log\-probability trajectories for Advocate and Skeptic responses on Essay Sets 7 \(left\) and 8 \(right\)\. Solid lines denote mean trajectories; shaded regions indicate percentile bands \(25–75 and 10–90\)\.
### 4\.4Failure Analysis

![Refer to caption](https://arxiv.org/html/2606.10307v1/img/pieChart2.png)Figure 3:Distribution of LLM\-as\-judge scores across evaluation dimensions\. Evidence grounding is highly saturated at the maximum score, while instruction following and justification quality exhibit greater variability\.##### LLM\-as\-Judge Meta\-Evaluator\.

Figure[3](https://arxiv.org/html/2606.10307#S4.F3)shows that the meta\-evaluator exhibits strong score concentration across all dimensions, assigning the maximum score \(3\) in the majority of cases: 76\.5% for instruction following, 79\.3% for justification quality, and 93\.1% for evidence grounding\. Despite this overall skew, the three dimensions differ markedly in their ability to discriminate between responses\. Evidence grounding is highly saturated and behaves almost as a binary signal, contributing little variation\. In contrast, instruction following and justification quality account for most of the observable differences in scores\.

Among these, justification quality shows the strongest role asymmetry, with a 13% gap in pass rates between the Advocate and Skeptic\. Instruction following captures broader procedural failures across both roles, with the Skeptic penalized more heavily \(84\.2% vs\. 93\.6%\), primarily due to violations of explicit instructions\.

These patterns reflect fundamentally different failure modes across roles\. Advocate failures are primarily associated with justification quality, especially through overstatement\. The Advocate frequently amplifies weak or incorrect evidence, presenting flawed reasoning as strong or mischaracterizing surface\-level features\. Importantly, these errors occur along a continuum, ranging from mild exaggeration to clear misrepresentation\.

In contrast, Skeptic failures are dominated by instruction\-following violations\. The most common issue arises from engaging with anonymization placeholders \(e\.g\.,@CAPS1,@NUM2\) despite explicit instructions to ignore them\. Unlike Advocate errors, these failures are largely binary: the Skeptic either adheres to the procedure or violates it\.

This distinction explains the persistent role asymmetry observed in the correlation analysis\. Because Advocate errors vary continuously, they induce a broader distribution of scores, enabling stronger alignment with confidence signals \(e\.g\.,ρ=0\.373\\rho=0\.373for instruction following\)\. Skeptic errors, by contrast, collapse into near binary outcomes, limiting score variability and compressing rank\-based correlations regardless of the underlying signal\.

A similar pattern appears in critical failure detection\. Advocate failures—often driven by hallucinated or fabricated evidence—produce sharper confidence anomalies, leading to stronger detection performance \(e\.g\., AUROC≈0\.76\\approx 0\.76\)\. These errors typically involve claims about nonexistent essay features, introducing low\-probability tokens during generation\. In contrast, Skeptic failures are predominantly procedural and do not produce comparable confidence deviations, resulting in weaker detection signals \(AUROC≈0\.63\\approx 0\.63\)\.

## 5Conclusion

We presented a framework that couples multi\-agent debate with LLM\-as\-judge meta\-evaluation to study whether token\-level confidence signals can predict the quality of open\-ended argumentative reasoning in LLM\-based essay scoring\. Across both ASAP essay sets, we find that early\-window log\-probability statistics, particularly dispersion measures over the first few generated tokens, are consistently the strongest predictors of externally judged reasoning quality\. This finding is supported both by the correlation analysis and by trajectory\-level evidence showing that the opening segment of generation is the most heterogeneous, and therefore the most informative, region of the response\. We also identify a stable Advocate–Skeptic asymmetry: confidence proxies correlate more reliably with Advocate reasoning quality than with Skeptic reasoning quality, a difference traceable to the distinct failure modes of each role\.

## Limitations

This study is exploratory in scope and several factors limit the generalizability of the findings\. Our experiments are conducted on only two ASAP essay sets and a single model family, although these subsets were selected because they are the only portions of the dataset that provide the trait\-level annotations required for fine\-grained reasoning analysis\. Despite this narrower setting, we observe consistent trajectory\-level patterns across both datasets, particularly the predictive value of early\-token confidence signals\.

In addition, the reported correlations are moderate rather than deterministic, suggesting that token\-level confidence should be interpreted as a useful auxiliary signal rather than a standalone measure of reasoning quality\. Relatedly, our framework relies on LLM\-as\-judge evaluation, which may partially reflect alignment between models instead of fully objective reasoning assessment\. To mitigate this, we use structured rubric\-based scoring, deterministic meta\-evaluation, multiple complementary metrics, and detailed trajectory and failure analyses\. Nevertheless, future work would benefit from broader cross\-domain experiments, human validation studies, stronger statistical testing, and evaluation on additional model families and debate settings\.

## Ethical Statement

This work studies confidence estimation and reasoning evaluation in multi\-agent LLM systems for automated essay scoring\. Because educational assessment systems may inherit biases present in both datasets and language models, the proposed framework should not be viewed as a replacement for human judgment in high\-stakes settings\. In addition, LLM\-as\-judge evaluation and token\-level confidence estimates are imperfect proxies for reasoning quality and may reflect model\-specific biases or calibration issues\. Our goal is therefore not to automate educational decision\-making, but to better understand the reliability and interpretability of multi\-agent reasoning systems\. All experiments were conducted on publicly available benchmark data and commercially available language models\.

## Acknowledgments

This paper is based upon work supported by the National Science Foundation under Grant No\. 2315294\.

## References

- ChatEval: towards better llm\-based evaluators through multi\-agent debate\.External Links:2308\.07201,[Link](https://arxiv.org/abs/2308.07201)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- J\. Chen, Y\. Lu, X\. Wang, H\. Zeng, J\. Huang, J\. Gesi, Y\. Xu, B\. Yao, and D\. Wang \(2025\)Multi\-agent\-as\-judge: aligning llm\-agent\-based automated evaluation with multi\-dimensional human evaluation\.External Links:2507\.21028,[Link](https://arxiv.org/abs/2507.21028)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p3.1)\.
- C\. Chiang and H\. Lee \(2023\)Can large language models be an alternative to human evaluations?\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 15607–15631\.External Links:[Link](https://aclanthology.org/2023.acl-long.870/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.870)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p1.1)\.
- S\. A\. Crossley, P\. Baffour, L\. Burleigh, and J\. King \(2025\)A large\-scale corpus for assessing source\-based writing quality: asap 2\.0\.Assessing Writing65,pp\. 100954\.External Links:ISSN 1075\-2935,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.asw.2025.100954),[Link](https://www.sciencedirect.com/science/article/pii/S1075293525000418)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p2.1)\.
- S\. Desai and G\. Durrett \(2020\)Calibration of pre\-trained transformers\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 295–302\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.21/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.21)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Dikli \(2006\)An overview of automated scoring of essays\.The Journal of Technology, Learning and Assessment5\(1\)\.External Links:[Link](https://ejournals.bc.edu/index.php/jtla/article/view/1640)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p2.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2023\)Improving factuality and reasoning in language models through multiagent debate\.External Links:2305\.14325,[Link](https://arxiv.org/abs/2305.14325)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p1.1),[§1](https://arxiv.org/html/2606.10307#S1.p3.1)\.
- Y\. Dubois, B\. Galambosi, P\. Liang, and T\. B\. Hashimoto \(2025\)Length\-controlled alpacaeval: a simple way to debias automatic evaluators\.External Links:2404\.04475,[Link](https://arxiv.org/abs/2404.04475)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Fallah, A\. Keramati, M\. A\. Nazari, and F\. S\. Mirfazeli \(2024\)Automating theory of mind assessment with a llama\-3\-powered chatbot: enhancing faux pas detection in autism\.In2024 14th International Conference on Computer and Knowledge Engineering \(ICCKE\),Vol\.,pp\. 365–372\.External Links:[Document](https://dx.doi.org/10.1109/ICCKE65377.2024.10874775)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p2.1)\.
- J\. Fu, S\. Ng, Z\. Jiang, and P\. Liu \(2024\)GPTScore: evaluate as you desire\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 6556–6576\.External Links:[Link](https://aclanthology.org/2024.naacl-long.365/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.365)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Gong and J\. Mao \(2023\)CoAScore: chain\-of\-aspects prompting for nlg evaluation\.External Links:2312\.10355,[Link](https://arxiv.org/abs/2312.10355)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- S\. Han, Q\. Zhang, W\. Jin, and Z\. Xu \(2024\)LLM multi\-agent systems: challenges and open problems\.arXiv preprint arXiv:2402\.03578\.Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p1.1)\.
- H\. Huang, X\. Bu, H\. Zhou, Y\. Qu, J\. Liu, M\. Yang, B\. Xu, and T\. Zhao \(2025\)An empirical study of LLM\-as\-a\-judge for LLM evaluation: fine\-tuned judge model is not a general substitute for GPT\-4\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5880–5895\.External Links:[Link](https://aclanthology.org/2025.findings-acl.306/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.306),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- H\. Jeong, C\. Park, J\. Hong, and J\. Choo \(2024\)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.External Links:2207\.05221,[Link](https://arxiv.org/abs/2207.05221)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p2.1)\.
- Z\. Kang, X\. Zhao, and D\. Song \(2025\)Scalable best\-of\-n selection for large language models via self\-certainty\.In2nd AI for Math Workshop @ ICML 2025,External Links:[Link](https://openreview.net/forum?id=nddwJseiiy)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p5.1),[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p2.1)\.
- A\. Keramati and M\. Warschauer \(2025\)MADEST: multi\-agent debate essay scoring triangulation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.17196206),[Link](https://doi.org/10.5281/zenodo.17196206)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p3.1),[§3](https://arxiv.org/html/2606.10307#S3.p1.1)\.
- R\. Li, T\. Patel, and X\. Du \(2024\)PRD: peer rank and discussion improve large language model based evaluations\.External Links:2307\.02762,[Link](https://arxiv.org/abs/2307.02762)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- T\. Liu, Y\. Zhang, C\. Brockett, Y\. Mao, Z\. Sui, W\. Chen, and B\. Dolan \(2022\)A token\-level reference\-free hallucination detection benchmark for free\-form text generation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 6723–6737\.External Links:[Link](https://aclanthology.org/2022.acl-long.464/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.464)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Liu, K\. Shi, A\. Fabbri, Y\. Zhao, P\. Wang, C\. Wu, S\. Joty, and A\. Cohan \(2025\)ReIFE: re\-evaluating instruction\-following evaluation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 12247–12287\.External Links:[Link](https://aclanthology.org/2025.naacl-long.610/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.610),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 9802–9822\.External Links:[Link](https://aclanthology.org/2023.acl-long.546/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.546)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9004–9017\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.557/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.557)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p1.1)\.
- V\. Mavi, S\. Jaroria, and W\. Sun \(2025\)Self\-evaluating llms for multi\-step tasks: stepwise confidence estimation for failure detection\.External Links:2511\.07364,[Link](https://arxiv.org/abs/2511.07364)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p5.1),[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p2.1)\.
- A\. Pack, A\. Barrett, and J\. Escalante \(2024\)Large language models and automated essay scoring of english language learner writing: insights into validity and reliability\.Computers and Education: Artificial Intelligence6,pp\. 100234\.External Links:ISSN 2666\-920X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.caeai.2024.100234),[Link](https://www.sciencedirect.com/science/article/pii/S2666920X24000353)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p2.1)\.
- M\. Parmar, X\. Liu, P\. Goyal, Y\. Chen, L\. Le, S\. Mishra, H\. Mobahi, J\. Gu, Z\. Wang, H\. Nakhost, C\. Baral, C\. Lee, T\. Pfister, and H\. Palangi \(2025\)PlanGEN: a multi\-agent framework for generating planning and reasoning trajectories for complex problem solving\.External Links:2502\.16111,[Link](https://arxiv.org/abs/2502.16111)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p1.1)\.
- E\. Quevedo, J\. Yero, R\. Koerner, P\. Rivas, and T\. Cerny \(2024\)Detecting hallucinations in large language model generation: a token probability approach\.External Links:2405\.19648,[Link](https://arxiv.org/abs/2405.19648)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Saha, O\. Levy, A\. Celikyilmaz, M\. Bansal, J\. Weston, and X\. Li \(2024\)Branch\-solve\-merge improves large language model evaluation and generation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 8352–8370\.External Links:[Link](https://aclanthology.org/2024.naacl-long.462/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.462)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- P\. Wang, L\. Li, L\. Chen, Z\. Cai, D\. Zhu, B\. Lin, Y\. Cao, L\. Kong, Q\. Liu, T\. Liu, and Z\. Sui \(2024\)Large language models are not fair evaluators\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 9440–9450\.External Links:[Link](https://aclanthology.org/2024.acl-long.511/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.511)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.External Links:2308\.08155,[Link](https://arxiv.org/abs/2308.08155)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p1.1)\.
- A\. Wynn, H\. Satija, and G\. Hadfield \(2025\)Talk isn’t always cheap: understanding failure modes in multi\-agent debate\.External Links:2509\.05396,[Link](https://arxiv.org/abs/2509.05396)Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p3.1)\.
- S\. Ye, D\. Kim, S\. Kim, H\. Hwang, S\. Kim, Y\. Jo, J\. Thorne, J\. Kim, and M\. Seo \(2024\)FLASK: fine\-grained language model evaluation based on alignment skill sets\.External Links:2307\.10928,[Link](https://arxiv.org/abs/2307.10928)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zeng, J\. Yu, T\. Gao, Y\. Meng, T\. Goyal, and D\. Chen \(2024\)Evaluating large language models at evaluating instruction following\.External Links:2310\.07641,[Link](https://arxiv.org/abs/2310.07641)Cited by:[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§1](https://arxiv.org/html/2606.10307#S1.p4.1),[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.10307#S2.SS0.SSS0.Px1.p2.1)\.

## Appendix AData and Preprocessing

### A\.1Dataset Selection

We conduct our analysis on the ASAP \(Automated Student Assessment Prize\) dataset, a standard benchmark for essay scoring\. The dataset consists of eight prompt\-specific essay sets with varying genres, scoring rubrics, and grade levels\.

Our study focuses exclusively onEssay Sets 7 and 8, as these are the only subsets that provide*trait\-level annotations*from multiple human raters\. This property is essential for our setup, since we evaluate reasoning quality at the level of individual rubric traits rather than holistic scores\. The remaining essay sets are not used, as they only provide single aggregated scores and therefore do not support fine\-grained evaluation\.

### A\.2Relevant Dataset Characteristics

Table[2](https://arxiv.org/html/2606.10307#A1.T2)summarizes the key properties of the two essay sets used in our experiments\.

Table 2:Summary of the ASAP subsets used in this work\.These two sets differ in both rubric complexity and score ranges, providing a useful testbed for analyzing how confidence signals interact with reasoning quality under different evaluation conditions\.

### A\.3Prompt Context

Each essay is written in response to a fixed prompt\. For completeness, we include simplified versions of the prompts used in the selected sets:

##### Set 7\.

Students are asked to write a story about patience, either from personal experience or imagination\.

##### Set 8\.

Students are asked to write a true story in which laughter plays an important role\.

These prompts define the context in which both the debate agents and the evaluator operate\.

### A\.4Text Handling

The ASAP essays are transcriptions of handwritten student responses\. We use the text*as provided*, without any normalization or correction\. In particular, spelling errors, grammatical inconsistencies, and informal structures are preserved\.

This choice is important because the evaluation criteria \(e\.g\., evidence grounding and justification\) depend on the original textual content, and preprocessing could alter signals that are relevant to both the agents and the evaluator\.

## Appendix BTop\-3 Confidence Feature Rankings by Dataset

Tables[3](https://arxiv.org/html/2606.10307#A2.T3)and[4](https://arxiv.org/html/2606.10307#A2.T4)report the top three confidence features for each role–target pair on Essay Sets 7 and 8\. For ordinal targets, features are ranked by Spearman correlation with the LLM\-as\-judge score, with Kendall’sτ\\taureported as a secondary measure\. For critical failure detection, features are ranked by AUROC, with point\-biserial correlation \(PB\) included for completeness\.

To improve interpretability, we present features using descriptive names rather than implementation\-specific identifiers\. Early\-kkrefers to statistics computed over the firstkkgenerated tokens, while final\-half refers to the last 50% of the response\. Full\-response features are computed over the entire generated sequence\.

Table 3:Top\-3 confidence features for Essay Set 7\.Table 4:Top\-3 confidence features for Essay Set 8\.
## Appendix CAgent Prompt Templates

This section provides the system instructions used to define the roles of the agents in the debate framework\. All prompts are implemented as template files and rendered at runtime using a shared context dictionary\. The context includes the rubric trait name, the full rubric definition serialized as JSON, the essay text, the essay prompt or question, and the valid scoring range for the trait\.

In addition, the Synthesizer\-Judge receives the debate transcript produced by the Advocate and Skeptic agents\. These prompts establish strict role boundaries to ensure that each agent performs a specialized function in the debate process\.

The prompt configurations were developed through iterative experimentation, including pilot runs and refinements designed to enforce role adherence, maintain output consistency, and reduce undesired behaviors such as assigning scores prematurely or mixing multiple rubric traits in a single argument\.

For transparency and reproducibility, we provide the exact system instructions used to define each agent role\.

### C\.1Advocate Agent

The Advocate agent is responsible for presenting arguments that highlight the strengths of the essay with respect to a single rubric trait\. The agent receives the essay text and the rubric definition and produces an evidence\-based argument explaining how the essay satisfies the expectations of the trait\.

The Advocate is explicitly instructed to focus only on positive aspects of the essay and to avoid assigning a score or discussing weaknesses\. The agent may reference specific passages from the essay to support its claims\.

Advocate Agent System PromptYou are an Advocate Agent participating in a multi\-agent debate system for essay evaluation\.Your task is to analyze the essay and highlight strengths that demonstrate how the essay satisfies the rubric expectations for the trait "$TRAIT\_NAME"\.Focus exclusively on positive evidence from the essay\. Provide detailed reasoning supported by specific excerpts or paraphrased examples from the essay\.Do not assign a score and do not critique weaknesses\. Your role is solely to present arguments supporting the essay’s strengths with respect to the specified rubric trait\.Anonymization $ANON\_CONTEXTFigure 4:System instructions for the Advocate agent\.
### C\.2Skeptic Agent

The Skeptic agent provides a counterargument to the Advocate by identifying weaknesses or limitations in the essay relative to the same rubric trait\.

The Skeptic receives both the essay text and the Advocate’s argument and produces a critique that challenges the strengths presented or highlights aspects where the essay fails to meet the rubric expectations\.

The agent is instructed not to assign a score and to focus exclusively on critical analysis\.

Skeptic Agent System PromptYou are a Skeptic Agent participating in a multi\-agent debate system for essay evaluation\.Your task is to critically analyze the essay and identify weaknesses related to the rubric trait "$TRAIT\_NAME"\.Provide detailed critiques supported by specific references to the essay text\. Focus only on identifying shortcomings or areas where the essay does not fully satisfy the rubric expectations\.Do not assign a score and do not discuss strengths\. Your role is to challenge the essay’s performance with respect to the specified rubric trait\.Anonymization $ANON\_CONTEXTFigure 5:System instructions for the Skeptic agent\.
### C\.3Synthesizer\-Judge Agent

The Synthesizer\-Judge serves as the final decision\-maker in the debate process\. This agent reads the arguments produced by the Advocate and Skeptic and determines a final score for the rubric trait\.

The agent synthesizes the competing arguments and evaluates them against the rubric definition before assigning a score within the permitted range\.

Synthesizer\-Judge Agent System PromptYou are the Synthesizer\-Judge in a multi\-agent debate system for essay evaluation\.Your task is to read the debate transcript between the Advocate and Skeptic agents regarding the rubric trait "$TRAIT\_NAME"\.Carefully consider the arguments presented by both agents and evaluate them against the rubric expectations\.Based on the combined evidence, assign a final integer score between $MIN\_POINTS and $MAX\_POINTS for the essay on this rubric trait\.Anonymization $ANON\_CONTEXTFigure 6:System instructions for the Synthesizer\-Judge agent\.
### C\.4LLM\-as\-Judge Meta\-Evaluator

The meta\-evaluator agent is responsible for assessing the quality of each Advocate and Skeptic response\. The agent receives the original system prompt given to agents, context including the essay text and rubric, and the agent’s response\.

The meta\-evaluator scores each response along three dimensions: instruction following, justification quality, and evidence grounding, each on a three\-point ordinal scale \(1 = Low, 2 = Medium, 3 = High\)\. The evaluator also flags a critical issue when the response contains hallucinated evidence, severe deviation or violation of instructions, or internal contradictions\.

The meta\-evaluator is explicitly instructed to evaluate only the agent’s reasoning quality and role adherence\. It does not assess the essay itself or judge whether the score being argued is correct\.

The complete system instructions and output schema are provided in Figures[4](https://arxiv.org/html/2606.10307#A3.F4)\-[8](https://arxiv.org/html/2606.10307#A3.F8)\.

Meta\-Evaluation System PromptYou are a meta\-evaluator assessing the quality of an AI agent’s response in a multi\-agent essay\-scoring debate system\. Your goal is to evaluate how well the agent performs its role\.Important:Do NOT evaluate the essay itself\. Do NOT judge whether the essay score being argued is correct\. Evaluate ONLY the agent’s reasoning quality, role adherence, and use of essay evidence\.You will receive: \(1\) the agent’s system prompt, \(2\) the task prompt given to the agent, and \(3\) the agent’s response\. Evaluate the response across three dimensions using the full range of the scale:1 = Low,2 = Medium,3 = High\. Avoid defaulting to the middle score\. Evaluate each dimension independently\.Dimension 1 – Instruction Following\.3: Fully maintains role; completes all task components; no deviations\. 2: Generally follows instructions with a minor omission or slight deviation\. 1: Major or multiple deviations; neglects important instructions\.Dimension 2 – Justification Quality\.3: Multiple claims with clear reasoning; claim→\\rightarrowexplanation→\\rightarrowimplication structure\. 2: At least one supported claim; reasoning understandable but shallow or repetitive\. 1: Minimal or vague reasoning; assertions without explanation\.Dimension 3 – Evidence Grounding\.3: Two or more precise references including quotes or detailed paraphrases\. 2: One clear identifiable reference; other claims rely on general statements\. 1: Evidence vague, indirect, or missing\.Critical Issues Flag\.Setcritical\_flag = 1if any of the following occur: hallucinated essay evidence, severe internal contradiction, explicit instruction violation, or nonsensical output\. Otherwisecritical\_flag = 0\.Output:Return only a JSON object with fieldsinstruction\_following,justification\_quality,evidence\_grounding,critical\_flag,critical\_issues\_description, andreasoning\(2–3 sentences\)\.Figure 7:System instructions for the Meta\-Evaluator agent\.Meta\-Evaluation Task Prompt\# Agent Being Evaluated: \{AGENT\_TYPE\}\# Agent’s System Prompt \(Instructions Given to the Agent\)
\{AGENT\_SYSTEM\_PROMPT\}\# Agent’s User Prompt \(Task Context Given to the Agent\)
\{AGENT\_USER\_PROMPT\}\# Agent’s Actual Response
\{AGENT\_RESPONSE\}Now evaluate this agent’s response\. Use the 3\-point scale \(1=Low, 2=Medium, 3=High\) for each scored dimension and set thecritical\_flagto 0 or 1\. Output ONLY the JSON object\.Figure 8:Task prompt provided to the Meta\-Evaluator agent\.

Similar Articles

@rohanpaul_ai: New Stanford paper argues that, under equal reasoning budgets, one LLM usually solves multi-hop problems better than ma…

X AI KOLs Timeline

A new Stanford paper shows that under equal reasoning token budgets, single LLMs typically outperform multi-agent systems on multi-hop reasoning tasks, with gains from multi-agent setups often stemming from additional compute rather than architectural superiority. The paper uses the Data Processing Inequality to explain why information loss in handoffs harms multi-agent performance, and identifies context quality as the key factor where multi-agent systems can provide benefits.

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

Hacker News Top

Researchers from Boston University propose IMAD (Internalized Multi-Agent Debate), a two-stage fine-tuning framework that distills multi-agent debate into a single LLM, achieving up to 93% fewer tokens while matching or exceeding explicit multi-agent debate performance. The work also reveals agent-specific subspaces in activation space, enabling practical control over internalized reasoning behaviors including suppression of malicious agents.

Reasoning Can Be Restored by Correcting a Few Decision Tokens

arXiv cs.AI

This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.