DiffScore: Text Evaluation Beyond Autoregressive Likelihood

arXiv cs.CL 05/13/26, 04:00 AM Papers
Summary
This paper introduces DiffScore, a text evaluation framework based on Masked Large Diffusion Language Models that addresses positional bias in autoregressive scoring by using masked reconstruction.
arXiv:2605.11601v1 Announce Type: new Abstract: Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:15 AM
# Text Evaluation Beyond Autoregressive Likelihood
Source: [https://arxiv.org/html/2605.11601](https://arxiv.org/html/2605.11601)
Wen Lai1Yingli Shen2∗Dingnan Jin1Qing Cui1Jun Zhou1 Maosong Sun2Alexander Fraser3

1Ant Group2Tsinghua University3Technical University of Munich wen\.lai@tum\.de, syl@mail\.tsinghua\.edu\.cn

###### Abstract

Autoregressive language models are widely used for text evaluation, however, their left\-to\-right factorization introduces positional bias, i\.e\., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality\. We propose*masked reconstruction*as an alternative paradigm, where every token is scored using full bidirectional context\. We introduceDiffScore, an evaluation framework built on Masked Large Diffusion Language Models\. By measuring text recoverability across continuous masking rates,DiffScoreeliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence\. We further provide diagnostic tools unavailable to autoregressive frameworks:*multi\-timestep quality profiles*that decompose scores across masking rates, and*bidirectional PMI decomposition*that disentangles fluency from faithfulness\. Experiments across ten benchmarks show thatDiffScoreconsistently outperforms autoregressive baselines in both zero\-shot and fine\-tuned settings\. The code is released at:[https://github\.com/wenlai\-lavine/DiffScore](https://github.com/wenlai-lavine/DiffScore)\.

![Refer to caption](https://arxiv.org/html/2605.11601v1/x1.png)Figure 1:DiffScoreconsistently outperforms all baselines across 10 diverse evaluation benchmarks\.## 1Introduction

Evaluating natural language generation \(NLG\) remains challenging due to substantial lexical variation among semantically equivalent outputs[clark\-etal\-2021\-thats](https://arxiv.org/html/2605.11601#bib.bib1);[gehrmann2023repairing](https://arxiv.org/html/2605.11601#bib.bib2)\. Methods have evolved fromnn\-gram overlap \(BLEU[papineni\-etal\-2002\-bleu](https://arxiv.org/html/2605.11601#bib.bib3), ROUGE[lin\-2004\-rouge](https://arxiv.org/html/2605.11601#bib.bib4)\) and semantic matching \(BERTScore[Zhang\*2020BERTScore:](https://arxiv.org/html/2605.11601#bib.bib5), MoverScore[zhao\-etal\-2019\-moverscore](https://arxiv.org/html/2605.11601#bib.bib6)\) to generative scoring \(BARTScore[yuan2021bartscore](https://arxiv.org/html/2605.11601#bib.bib7), GPTScore[fu\-etal\-2024\-gptscore](https://arxiv.org/html/2605.11601#bib.bib8)\) and LLM\-as\-Judge approaches \(G\-Eval[liu\-etal\-2023\-g](https://arxiv.org/html/2605.11601#bib.bib9)\)\. Among these, autoregressive \(AR\) scoring via conditional log\-likelihood∑nlog⁡p\(xn∣x<n,𝐬\)\\sum\_\{n\}\\log p\(x^\{n\}\\mid x^\{<n\},\\mathbf\{s\}\)remains a principled framework\. However, AR scoring carries a fundamental limitation: left\-to\-right factorization conditions each token solely on preceding context, introducing*positional bias*whereby early tokens are evaluated under information\-impoverished contexts with lower signal\-to\-noise ratios \(§[6\.3](https://arxiv.org/html/2605.11601#S6.SS3)\)\. The Reversal Curse[berglund2024the](https://arxiv.org/html/2605.11601#bib.bib10), whereby AR models fail to infer “B is A” from “A is B,” further exposes an intrinsic directional asymmetry, suggesting AR scores reflect architectural bias as much as true text quality\.

We propose modeling text evaluation as*masked reconstruction*: measuring a model’s ability to recover candidate text under varying degrees of context ablation\. High\-quality text conforms to linguistic regularities and is thus confidently reconstructable under arbitrary masking patterns\. Masked Large Diffusion Language Models \(MDLLMs;[sahoo2024simple](https://arxiv.org/html/2605.11601#bib.bib11);[nie2025large](https://arxiv.org/html/2605.11601#bib.bib12);[ye2025dream](https://arxiv.org/html/2605.11601#bib.bib13)\) provide an ideal backbone, offering three structural advantages: \(1\)Bidirectionalityeliminates positional bias by conditioning on bilateral context; \(2\)Multi\-granularity signalsfrom diverse masking rates span an evaluation hierarchy from local fluency to global coherence; and \(3\)Native objective alignmentensures the pretraining objective directly matches the evaluation task\.

We instantiate this paradigm asDiffScore, the first masked reconstruction\-based NLG evaluation framework, in two variants\.DiffScore\-Zeroperforms zero\-shot quality estimation via the MDLLM’s evidence lower bound \(ELBO\)\.DiffScore\-FTis fine\-tuned on task\-relevant corpora without gold\-standard references, isolating the performance contribution of the evaluation paradigm itself\. For interpretability, we introduce two diagnostic tools: \(1\)*multi\-timestep quality profiles*, which decompose scores across masking rates; and \(2\)*bidirectional PMI decomposition*, which disentangles fluency and faithfulness into independent dimensions, a capability structurally unattainable in AR frameworks\.

Experiments across 10 benchmarks spanning summarization, machine translation, and data\-to\-text generation validate our approach\. As shown in Figure[1](https://arxiv.org/html/2605.11601#S0.F1), DiffScore\-FT outperforms BARTScore in most settings, while DiffScore\-Zero surpasses GPTScore under zero\-shot conditions, highlighting MDLLMs as potent out\-of\-the\-box evaluators\. Analyses confirm lower positional bias, stronger directional consistency, and superior interpretability relative to AR baselines\.

Our contributions are:\(I\)We introduce masked reconstruction for NLG evaluation and buildDiffScore, the first MDLLM\-based evaluation framework, directly addressing the structural flaws of AR scoring\.\(II\)We propose multi\-timestep quality profiles and bidirectional PMI decomposition, enabling multi\-dimensional quality insights beyond AR frameworks\.\(III\)We demonstrate thatDiffScore, built on open\-weight MDLLMs, offers a competitive and fully reproducible alternative to proprietary LLM\-based evaluators such as G\-Eval\([liu\-etal\-2023\-g,](https://arxiv.org/html/2605.11601#bib.bib9)\), requiring neither API access nor prompt engineering\.

## 2Preliminaries

We establish notation and review the two foundations of our framework: autoregressive text evaluation \(§[2\.1](https://arxiv.org/html/2605.11601#S2.SS1)\) and masked diffusion language models \(§[2\.2](https://arxiv.org/html/2605.11601#S2.SS2)\)\.

Notation\.Let𝒱\\mathcal\{V\}be a finite vocabulary,\[M\]∉𝒱\\texttt\{\[M\]\}\\notin\\mathcal\{V\}a mask token, and𝒱¯≜𝒱∪\{\[M\]\}\\bar\{\\mathcal\{V\}\}\\triangleq\\mathcal\{V\}\\cup\\\{\\texttt\{\[M\]\}\\\}\. A length\-LLsequence is denoted𝐱=\(x1,…,xL\)∈𝒱L\\mathbf\{x\}=\(x^\{1\},\\ldots,x^\{L\}\)\\in\\mathcal\{V\}^\{L\}\. For evaluation, we assess a*candidate*text𝐜=\(c1,…,cLc\)\\mathbf\{c\}=\(c^\{1\},\\ldots,c^\{L\_\{c\}\}\)given a*source*text𝐬=\(s1,…,sLs\)\\mathbf\{s\}=\(s^\{1\},\\ldots,s^\{L\_\{s\}\}\)\.

### 2\.1Autoregressive Text Evaluation

An autoregressive \(AR\) language model factorizes sequence log\-probability left\-to\-right:

log⁡pAR\(𝐱\)=∑n=1Llog⁡pAR\(xn∣x<n\)\.\\log p\_\{\\mathrm\{AR\}\}\(\\mathbf\{x\}\)=\\sum\_\{n=1\}^\{L\}\\log p\_\{\\mathrm\{AR\}\}\(x^\{n\}\\mid x^\{<n\}\)\.\(1\)BARTScore[yuan2021bartscore](https://arxiv.org/html/2605.11601#bib.bib7)repurposes this for text evaluation, scoring a candidate𝐜\\mathbf\{c\}given a source𝐬\\mathbf\{s\}via:

BARTScore\(𝐜∣𝐬\)=∑n=1Lclog⁡pAR\(cn∣c<n,𝐬\),\\textsc\{BARTScore\}\(\\mathbf\{c\}\\mid\\mathbf\{s\}\)=\\sum\_\{n=1\}^\{L\_\{c\}\}\\log p\_\{\\mathrm\{AR\}\}\(c^\{n\}\\mid c^\{<n\},\\,\\mathbf\{s\}\),\(2\)wherepARp\_\{\\mathrm\{AR\}\}is a task\-finetuned BART\-large[lewis\-etal\-2020\-bart](https://arxiv.org/html/2605.11601#bib.bib14)\. Three variants capture distinct quality dimensions:*marginal*BARTScore\(𝐜\)\\textsc\{BARTScore\}\(\\mathbf\{c\}\)for intrinsic fluency;*reverse*BARTScore\(𝐬∣𝐜\)\\textsc\{BARTScore\}\(\\mathbf\{s\}\\mid\\mathbf\{c\}\)for information coverage; and*bidirectional*\(averaging both\) for holistic assessment\. However, AR factorization introduces two structural flaws[berglund2024the](https://arxiv.org/html/2605.11601#bib.bib10):*positional bias*, as early tokens are evaluated with impoverished context due to unidirectional conditioning; and*directional asymmetry*, where the fixed order conflates architectural artifacts with intrinsic text quality\.

### 2\.2Masked Large Diffusion Language Models

Masked large diffusion language models[sahoo2024simple](https://arxiv.org/html/2605.11601#bib.bib11);[nie2025large](https://arxiv.org/html/2605.11601#bib.bib12);[ye2025dream](https://arxiv.org/html/2605.11601#bib.bib13)define a discrete diffusion process over sequences using\[M\]as an absorbing state\.

Forward \(masking\) process\.At continuous timet∈\[0,1\]t\\in\[0,1\], the forward process independently corrupts each token of a clean sequence𝐱0∈𝒱L\\mathbf\{x\}\_\{0\}\\in\\mathcal\{V\}^\{L\}:

q\(xti∣x0i\)=\(1−t\)1\[xti=x0i\]\+t1\[xti=\[M\]\],q\(x\_\{t\}^\{i\}\\mid x\_\{0\}^\{i\}\)=\(1\-t\)\\,\\mathbb\{1\}\[x\_\{t\}^\{i\}=x\_\{0\}^\{i\}\]\\;\+\\;t\\,\\mathbb\{1\}\[x\_\{t\}^\{i\}=\\texttt\{\[M\]\}\],\(3\)replacing it with\[M\]with probabilitytt\. Let𝐱t∈𝒱¯L\\mathbf\{x\}\_\{t\}\\in\\bar\{\\mathcal\{V\}\}^\{L\}denote the corrupted sequence andℳt≜\{i:xti=\[M\]\}\\mathcal\{M\}\_\{t\}\\triangleq\\\{i:x\_\{t\}^\{i\}=\\texttt\{\[M\]\}\\\}the masked positions\. The sequence transitions from fully intact \(t=0t\{=\}0\) to fully masked \(t=1t\{=\}1\)\.

Reverse \(denoising\) process\.A neural networkθ\\thetapredicts clean tokens at masked positions conditioned on the partially masked sequence:

pθ\(𝐱^0∣𝐱t\)=∏i∈ℳtpθ\(x^0i∣𝐱t\)\.p\_\{\\theta\}\(\\hat\{\\mathbf\{x\}\}\_\{0\}\\mid\\mathbf\{x\}\_\{t\}\)=\\prod\_\{i\\in\\mathcal\{M\}\_\{t\}\}p\_\{\\theta\}\(\\hat\{x\}\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\}\)\.\(4\)Crucially,pθp\_\{\\theta\}is a bidirectional Transformer*without causal masking*, so each prediction attends to all unmasked tokens regardless of position, unlike AR unidirectional conditioning \(Eq\.[1](https://arxiv.org/html/2605.11601#S2.E1)\)\.

Evidence lower bound\.The log\-marginal likelihood admits a variational lower bound[sahoo2024simple](https://arxiv.org/html/2605.11601#bib.bib11):

log⁡pθ\(𝐱0\)≥𝔼t∼𝒰\(0,1\)𝔼𝐱t∼q\(𝐱t∣𝐱0\)\[1t∑i∈ℳtlog⁡pθ\(x0i∣𝐱t\)\]⏟≜ELBO\(𝐱0;θ\)\.\\log p\_\{\\theta\}\(\\mathbf\{x\}\_\{0\}\)\\;\\geq\\;\\underbrace\{\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\(0,1\)\}\\;\\mathbb\{E\}\_\{\\mathbf\{x\}\_\{t\}\\sim q\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\)\}\\\!\\left\[\\,\\frac\{1\}\{t\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\}\)\\,\\right\]\}\_\{\\displaystyle\\triangleq\\;\\mathrm\{ELBO\}\(\\mathbf\{x\}\_\{0\};\\,\\theta\)\}\.\(5\)The1/t1/tfactor reweights the expected number of masked tokens \(𝔼\[\|ℳt\|\]≈tL\\mathbb\{E\}\[\|\\mathcal\{M\}\_\{t\}\|\]\\approx tL\), equalizing per\-token contributions across timesteps\. The ELBO is estimated by uniformly samplingKKtimestepstk∼𝒰\(0,1\]t\_\{k\}\\sim\\mathcal\{U\}\(0,1\]and masking patterns𝐱tk∼q\(𝐱tk∣𝐱0\)\\mathbf\{x\}\_\{t\_\{k\}\}\\sim q\(\\mathbf\{x\}\_\{t\_\{k\}\}\\mid\\mathbf\{x\}\_\{0\}\)\.

## 3DiffScore

We conceptualize text evaluation as measuring*reconstruction fidelity*: high\-quality text should be robustly recoverable from partial masking\. Operationalizing this via the ELBO of MDLLMs endowsDiffScorewith two structural advantages over AR metrics: \(1\)Bidirectionality, which eliminates positional bias; and \(2\)Factorization\-invariance, as it marginalizes over random masking patterns rather than relying on a deterministic generation order\. A formal analysis is provided in Appendix[A](https://arxiv.org/html/2605.11601#A1)\.

### 3\.1Scoring Configurations and Estimation

To evaluate a candidate𝐜\\mathbf\{c\}\(lengthLcL\_\{c\}\) given a source𝐬\\mathbf\{s\}\(lengthLsL\_\{s\}\), we uniformly sampleKKmasking timestepstk∼𝒰\(0,1\]t\_\{k\}\\sim\\mathcal\{U\}\(0,1\]and independent masking patterns𝐱tk∼q\(𝐱tk∣𝐱\)\\mathbf\{x\}\_\{t\_\{k\}\}\\sim q\(\\mathbf\{x\}\_\{t\_\{k\}\}\\mid\\mathbf\{x\}\)\. Letℳtk\\mathcal\{M\}\_\{t\_\{k\}\}denote the masked positions\. We define a generalized empirical estimator:

𝒮\(𝐲∣𝐱\)=1\|𝐲\|1K∑k=1Kω\(tk\)∑i∈ℳtklog⁡pθ\(yi∣𝐱,𝐲tk\),\\mathcal\{S\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)=\\frac\{1\}\{\|\\mathbf\{y\}\|\}\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\omega\(t\_\{k\}\)\\sum\_\{i\\in\\mathcal\{M\}\_\{t\_\{k\}\}\}\\log p\_\{\\theta\}\(y^\{i\}\\mid\\mathbf\{x\},\\mathbf\{y\}\_\{t\_\{k\}\}\),\(6\)whereω\(tk\)\\omega\(t\_\{k\}\)is a weighting function\. We default to the*mean log\-probability*\(mlp\) variant whereω\(tk\)=1/\|ℳtk\|\\omega\(t\_\{k\}\)=1/\|\\mathcal\{M\}\_\{t\_\{k\}\}\|, which yields interpretable per\-token scores and avoids numerical instability at low masking rates compared to the strict ELBO weightingω\(tk\)=1/tk\\omega\(t\_\{k\}\)=1/t\_\{k\}\. Since MDLLMs evaluate all masked positions independently, theKKforward passes are fully parallelizable\.

Mirroring directional variants in prior work[yuan2021bartscore](https://arxiv.org/html/2605.11601#bib.bib7), we instantiate Eq\.[6](https://arxiv.org/html/2605.11601#S3.E6)into four configurations:

#### Marginal score \(Fluency\)\.

Evaluates intrinsic quality without source context:

DiffScoremar\(𝐜\)=𝒮\(𝐜∣∅\)\.\\textsc\{DiffScore\}\_\{\\mathrm\{mar\}\}\(\\mathbf\{c\}\)=\\mathcal\{S\}\(\\mathbf\{c\}\\mid\\emptyset\)\.\(7\)

#### Conditional score \(Faithfulness\)\.

Evaluates candidate quality conditioned on the fully visible source𝐬\\mathbf\{s\}:

DiffScorecond\(𝐜∣𝐬\)=𝒮\(𝐜∣𝐬\)\.\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\(\\mathbf\{c\}\\mid\\mathbf\{s\}\)=\\mathcal\{S\}\(\\mathbf\{c\}\\mid\\mathbf\{s\}\)\.\(8\)

#### Reverse score \(Coverage\)\.

Evaluates information coverage by masking the source while leaving the candidate visible:

DiffScorerev\(𝐬∣𝐜\)=𝒮\(𝐬∣𝐜\)\.\\textsc\{DiffScore\}\_\{\\mathrm\{rev\}\}\(\\mathbf\{s\}\\mid\\mathbf\{c\}\)=\\mathcal\{S\}\(\\mathbf\{s\}\\mid\\mathbf\{c\}\)\.\(9\)

#### Bidirectional score \(Holistic\)\.

Combines faithfulness and coverage by evaluating both directions of the source\-candidate relationship:

DiffScorebi\(𝐜,𝐬\)=α⋅DiffScorecond\(𝐜∣𝐬\)\+\(1−α\)⋅DiffScorerev\(𝐬∣𝐜\),\\textsc\{DiffScore\}\_\{\\mathrm\{bi\}\}\(\\mathbf\{c\},\\mathbf\{s\}\)=\\alpha\\cdot\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\(\\mathbf\{c\}\\mid\\mathbf\{s\}\)\+\(1\-\\alpha\)\\cdot\\textsc\{DiffScore\}\_\{\\mathrm\{rev\}\}\(\\mathbf\{s\}\\mid\\mathbf\{c\}\),\(10\)whereα=0\.5\\alpha\{=\}0\.5by default\. Unlike AR frameworks that suffer from directional asymmetry \(the Reversal Curse[berglund2024the](https://arxiv.org/html/2605.11601#bib.bib10)\), MDLLM provides a structurally consistent substrate for both directions\.

### 3\.2Diagnostic Dimensions: Quality Profiles and PMI Decomposition

Beyond scalar scores,DiffScoreoffers diagnostic tools structurally unattainable in AR frameworks\.

#### Multi\-Timestep Quality Profiles\.

The masking ratettserves as a continuous dial over evaluation granularity\. By discretizing\(0,1\]\(0,1\]intoTTlevels, we extract timestep\-specific scores\{S\(tk\)\}k=1T\\\{S\(t\_\{k\}\)\\\}\_\{k=1\}^\{T\}\. Atlowtt\(context\-rich\), the profile captures*local fluency*; athightt\(context\-sparse\), it captures*semantic coherence*\. The aggregate score can be viewed as a weighted profile:DiffScoreprofile=∑kwk⋅S\(tk\)\\textsc\{DiffScore\}\_\{\\mathrm\{profile\}\}=\\sum\_\{k\}w\_\{k\}\\cdot S\(t\_\{k\}\)\.

#### Bidirectional PMI Decomposition\.

To disentangle intrinsic fluency from source\-dependent faithfulness, we isolate the information gain via Pointwise Mutual Information \(PMI\):

DiffScorePMI\(𝐜,𝐬\)=DiffScorecond\(𝐜∣𝐬\)−DiffScoremar\(𝐜\)\.\\textsc\{DiffScore\}\_\{\\mathrm\{PMI\}\}\(\\mathbf\{c\},\\mathbf\{s\}\)=\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\(\\mathbf\{c\}\\mid\\mathbf\{s\}\)\-\\textsc\{DiffScore\}\_\{\\mathrm\{mar\}\}\(\\mathbf\{c\}\)\.\(11\)While AR models can compute PMI, the resultant score inherits unidirectional positional biases from both terms\.DiffScore’s bidirectional context ensures a strictly orthogonal separation of fluency and relevance\.

### 3\.3Domain Adaptation: Zero\-Shot vs\. Fine\-Tuned

Our framework operates in two modes\.

#### DiffScore\-Zero\.

Uses an off\-the\-shelf, instruction\-tuned MDLLM \(e\.g\., LLaDA[nie2025large](https://arxiv.org/html/2605.11601#bib.bib12)\) for zero\-shot, unsupervised quality estimation\.

#### DiffScore\-FT\.

Fine\-tunes the MDLLM on task\-relevant corpora*without*gold\-standard references; the model learns domain\-specific reconstruction patterns from source\-candidate pairs alone\. FollowingBARTScore’s strategy, we use CNN/DailyMail as the fine\-tuning corpus and optimize the masked reconstruction loss exclusively on the candidate portion with the source fully visible \(Appendix[D\.3](https://arxiv.org/html/2605.11601#A4.SS3)\):

ℒFT\(θ\)=𝔼\(𝐬,𝐜\),t,𝐜t\[1\|ℳt\|∑i∈ℳt−log⁡pθ\(ci∣𝐬,𝐜t\)\]\.\\mathcal\{L\}\_\{\\mathrm\{FT\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(\\mathbf\{s\},\\mathbf\{c\}\),\\,t,\\,\\mathbf\{c\}\_\{t\}\}\\left\[\\frac\{1\}\{\|\\mathcal\{M\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\-\\log p\_\{\\theta\}\(c^\{i\}\\mid\\mathbf\{s\},\\mathbf\{c\}\_\{t\}\)\\right\]\.\(12\)

## 4Experiments

### 4\.1Datasets and Tasks

We meta\-evaluateDiffScoreacross 10 benchmarks spanning three NLG tasks, strictly mirroring the experimental setup of BARTScore[yuan2021bartscore](https://arxiv.org/html/2605.11601#bib.bib7)to isolate the impact of the generative paradigm \(autoregressive vs\. masked diffusion\)\. Although individual benchmarks provide fine\-grained quality annotations \(e\.g\., coherence, faithfulness, fluency\), our main evaluation aggregates these into an overall comparison \(Figure[1](https://arxiv.org/html/2605.11601#S0.F1)\), with per\-dimension results reported separately\. Further dataset details are in Appendix[B](https://arxiv.org/html/2605.11601#A2)\.

Summarization \(SUM\)\.6 benchmarks: SummEval[fabbri2021summeval](https://arxiv.org/html/2605.11601#bib.bib15), Newsroom[grusky\-etal\-2018\-newsroom](https://arxiv.org/html/2605.11601#bib.bib16), REALSumm[bhandari\-etal\-2020\-evaluating](https://arxiv.org/html/2605.11601#bib.bib17), Rank19[falke\-etal\-2019\-ranking](https://arxiv.org/html/2605.11601#bib.bib18), QAGS\-CNN[wang\-etal\-2020\-asking](https://arxiv.org/html/2605.11601#bib.bib19), and QAGS\-XSUM[wang\-etal\-2020\-asking](https://arxiv.org/html/2605.11601#bib.bib19), with annotations for coverage \(COV\), fluency \(FLU\), coherence \(COH\), relevance \(REL\), and faithfulness \(FAI\)\.

Machine Translation \(MT\)\.7 language pairs from WMT19[ma\-etal\-2019\-results](https://arxiv.org/html/2605.11601#bib.bib20)\(de\-en, fi\-en, gu\-en, kk\-en, lt\-en, ru\-en, zh\-en\) with segment\-level direct assessment \(DA\) scores\. In MT, the “source” for scoring purposes is the reference translation \(§[3\.1](https://arxiv.org/html/2605.11601#S3.SS1)\), while the original source\-language text is the input being translated\.

Data\-to\-Text \(D2T\)\.3 datasets: BAGEL[mairesse\-etal\-2010\-phrase](https://arxiv.org/html/2605.11601#bib.bib21), SFHOT[wen\-etal\-2015\-semantically](https://arxiv.org/html/2605.11601#bib.bib22), and SFRES[wen\-etal\-2015\-semantically](https://arxiv.org/html/2605.11601#bib.bib22), evaluating informativeness \(INF\) and naturalness \(NAT\) of texts generated from structured meaning representations\.

### 4\.2Baselines

We compareDiffScoreagainst a comprehensive suite of baselines, categorized into five paradigms:

\(1\) Lexical Overlap Metrics:BLEU[papineni\-etal\-2002\-bleu](https://arxiv.org/html/2605.11601#bib.bib3), ROUGE[lin\-2004\-rouge](https://arxiv.org/html/2605.11601#bib.bib4), and METEOR[banerjee\-lavie\-2005\-meteor](https://arxiv.org/html/2605.11601#bib.bib23), which rely on surface\-levelnn\-gram matching\. \(2\) Embedding\-based Metrics:BERTScore[Zhang\*2020BERTScore:](https://arxiv.org/html/2605.11601#bib.bib5)and MoverScore[zhao\-etal\-2019\-moverscore](https://arxiv.org/html/2605.11601#bib.bib6), which compute soft alignments using pre\-trained contextualized embeddings\. \(3\) Autoregressive Probability Metrics:BARTScore[yuan2021bartscore](https://arxiv.org/html/2605.11601#bib.bib7), a state\-of\-the\-art AR metric for conditional generation evaluation\. We also include GPTScore \(GPT\-2\-large\)[fu\-etal\-2024\-gptscore](https://arxiv.org/html/2605.11601#bib.bib8)as the zero\-shot AR baseline\. \(4\) Supervised Multi\-dimensional Evaluators:UniEval[zhong\-etal\-2022\-towards](https://arxiv.org/html/2605.11601#bib.bib24), AlignScore[zha\-etal\-2023\-alignscore](https://arxiv.org/html/2605.11601#bib.bib25), and QuestEval[scialom\-etal\-2021\-questeval](https://arxiv.org/html/2605.11601#bib.bib26), which heavily rely on auxiliary tasks \(e\.g\., NLI or Boolean QA\) and complex multi\-stage data curation\. \(5\) LLM\-as\-a\-Judge:G\-Eval[liu\-etal\-2023\-g](https://arxiv.org/html/2605.11601#bib.bib9)utilizing GPT\-4 via API, representative of modern prompting\-based evaluation paradigms\. Extended descriptions of all baselines are provided in Appendix[C](https://arxiv.org/html/2605.11601#A3)\.

### 4\.3Implementation

Model Configuration\.We instantiateDiffScorebased on LLaDA[nie2025large](https://arxiv.org/html/2605.11601#bib.bib12)\.DiffScore\-ZerousesLLaDA\-8B\-Instructwith a task\-specific prompt;DiffScore\-FTfine\-tunesLLaDA\-8B\-Base\(no prompt\)\. Training details are in Appendix[D](https://arxiv.org/html/2605.11601#A4)\.

DiffScoreConfiguration Selection\.We align the scoring formulation \(§[3\.1](https://arxiv.org/html/2605.11601#S3.SS1)\) with the specific quality dimension being evaluated \(see Appendix[E](https://arxiv.org/html/2605.11601#A5)for detailed rationale\):

- •Machine Translation \(Adequacy\):We use the conditional scoreDiffScorecond\(𝐜∣𝐫\)\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\(\\mathbf\{c\}\\mid\\mathbf\{r\}\), treating the reference translation as a fully visible source to measure how faithfully the candidate preserves the reference semantics\.
- •Summarization:The configuration varies by target dimension: \(i\)DiffScorecond\(𝐜∣𝐬\)\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\(\\mathbf\{c\}\\mid\\mathbf\{s\}\)probes*faithfulness*and*consistency*by reconstructing the candidate from the source document; \(ii\) the reverse scoreDiffScorerev\(𝐬∣𝐜\)\\textsc\{DiffScore\}\_\{\\mathrm\{rev\}\}\(\\mathbf\{s\}\\mid\\mathbf\{c\}\)evaluates*coverage*by recovering source content from the candidate; \(iii\) the marginal scoreDiffScoremar\(𝐜\)\\textsc\{DiffScore\}\_\{\\mathrm\{mar\}\}\(\\mathbf\{c\}\)assesses source\-independent*fluency*; and \(iv\) the bidirectional scoreDiffScorebi\(𝐜,𝐬\)\\textsc\{DiffScore\}\_\{\\mathrm\{bi\}\}\(\\mathbf\{c\},\\mathbf\{s\}\)evaluates*coherence*,*relevance*, and*informativeness*, capturing both source fidelity and intrinsic text quality\.
- •Data\-to\-Text \(All dimensions\):We adopt the bidirectional scoreDiffScorebi\(𝐜,𝐬\)\\textsc\{DiffScore\}\_\{\\mathrm\{bi\}\}\(\\mathbf\{c\},\\mathbf\{s\}\)for informativeness, naturalness, and overall quality, as D2T jointly demands structural fidelity to data records and linguistic naturalness\.

### 4\.4Evaluation Metrics

We meta\-evaluate the metrics by measuring their correlation with human judgments, we report:

- •Segment\-level:Kendall’sτ\\tauand Spearman’sρ\\rho, measuring alignment with human rankings of individual samples\.
- •System\-level:Pearson’srrand Spearman’sρ\\rho, assessing the metric’s ability to rank NLG systems\.

To ensure statistical rigor, we employ the Williams test[williams1959test](https://arxiv.org/html/2605.11601#bib.bib27)to ascertain the significance of performance differences betweenDiffScoreand the baselines\. Furthermore, we report 95% bootstrap confidence intervals \(resampled 1,000 times\) for all correlation scores to guarantee the stability and reliability of our findings\.

## 5Results

Figure[1](https://arxiv.org/html/2605.11601#S0.F1)summarizes the overall performance across 10 benchmarks\. We detail findings across three generation tasks\.

Table 1:Segment\-level Kendall correlation \(τ\\tau\) on the WMT19 Metrics shared task for sevenxx→\\toen language pairs\.Bold: best;underline: second best\.†/‡: significantly better thanBARTScore/GPTScore \(p<0\.05p<0\.05, Williams test\)\.Machine Translation\.Table[1](https://arxiv.org/html/2605.11601#S5.T1)reports segment\-level Kendallτ\\tauon WMT19\.DiffScore\-FToutperformsBARTScoreon six of seven pairs \(τavg=0\.356\\tau\_\{\\mathrm\{avg\}\}\{=\}0\.356vs\.0\.3420\.342\), with the largest gains on zh\-en and gu\-en \(both\+0\.030\+0\.030\), where word\-order divergence makes bidirectional reasoning particularly effective\. Zero\-shotDiffScoresurpasses GPTScore on five pairs \(τavg\\tau\_\{\\mathrm\{avg\}\}:0\.3450\.345vs\.0\.3120\.312\), confirming pre\-trained MDLLMs as robust out\-of\-the\-box evaluators\. While G\-Eval achieves high scores on some pairs, it suffers from cross\-lingual instability \(e\.g\.,τ=0\.120\\tau\{=\}0\.120on fi\-en vs\.0\.5200\.520on ru\-en\);DiffScoredelivers consistent evaluations without proprietary APIs\. Supervised evaluators \(UniEval, AlignScore, QuestEval\) generally underperform, suggesting NLI/QA training transfers poorly to translation\.

Table 2:Meta\-evaluation on six summarization benchmarks\. COV: coverage; COH: coherence; FAC: faithfulness; FLU: fluency; INFO: informativeness; REL: relevance; ACC: pairwise accuracy\. We report Spearmanρ\\rhofor SummEval and Newsroom \(segment\-level\), Pearsonrrfor REALSumm coverage \(system\-level\), pairwise accuracy for Rank19, and Pearsonrrfor QAGS\.Text Summarization\.Table[2](https://arxiv.org/html/2605.11601#S5.T2)reports correlations across 6 sub\-benchmarks\. Different datasets assess different quality dimensions: REALSumm measures coverage, SummEval provides four dimensions, and QAGS focuses on factual consistency\.DiffScore\-FToutperformsBARTScoreon 10 of 12 sub\-benchmarks \(ρavg\\rho\_\{\\mathrm\{avg\}\}:0\.5440\.544vs\.0\.5190\.519\), while zero\-shotDiffScoreexceeds GPTScore by\+0\.051\+0\.051\. The largest gains emerge on*faithfulness\-oriented*metrics: SummEval consistency \(\+0\.102\+0\.102\) and both QAGS datasets\. Unlike unidirectional models that only verify token plausibility from preceding context, bidirectional models enforce global coherence, yielding more reliable factual assessments\. Zero\-shotDiffScoretops all baselines on REALSumm coverage, indicating strong intrinsic capabilities for assessing information completeness\. While supervised evaluators excel on dimensions matched to their training data,DiffScore\-FTgeneralizes more robustly across diverse quality dimensions\.

Table 3:System\-level Spearman correlation \(ρ\\rho\) on three data\-to\-text benchmarks\. INF: informativeness; NAT: naturalness; QUA: overall quality\.Data\-to\-Text Generation\.Table[3](https://arxiv.org/html/2605.11601#S5.T3)reports system\-level Spearmanρ\\rhoon three benchmarks \(INF: informativeness; NAT: naturalness; QUA: overall quality\)\.DiffScore\-FToutperformsBARTScoreacross all nine sub\-benchmarks \(ρavg\\rho\_\{\\mathrm\{avg\}\}:0\.3100\.310vs\.0\.2400\.240\)\. SinceDiffScore\-FTis fine\-tuned solely on summarization data, this\+0\.070\+0\.070improvement demonstrates strong out\-of\-domain generalization\. The gain is most prominent on SFRES informativeness \(\+0\.153\+0\.153\), where bidirectional reasoning effectively captures complex attribute\-value mappings\. Zero\-shotDiffScorematches G\-Eval’s average performance, offering a competitive open\-weight alternative to proprietary evaluators\.

## 6Analysis

We investigate the structural mechanisms drivingDiffScore’s performance: multi\-timestep quality decomposition \(§[6\.1](https://arxiv.org/html/2605.11601#S6.SS1)\), PMI\-based fluency/relevance separation \(§[6\.2](https://arxiv.org/html/2605.11601#S6.SS2)\), positional and directional fairness \(§[6\.3](https://arxiv.org/html/2605.11601#S6.SS3)\), generalization capability \(§[6\.4](https://arxiv.org/html/2605.11601#S6.SS4)\), and sensitivity to design choices \(§[6\.5](https://arxiv.org/html/2605.11601#S6.SS5)\)\.

### 6\.1Multi\-Timestep Quality Profiles

![Refer to caption](https://arxiv.org/html/2605.11601v1/x2.png)Figure 2:Spearmanρ\\rhobetween timestep\-specificDiffScoreand human judgments on SummEval\.A central hypothesis ofDiffScoreis that the masking ratettserves as a continuous dial for evaluation granularity\. Figure[2](https://arxiv.org/html/2605.11601#S6.F2)validates this on SummEval\. Fluency correlations peak at low masking rates \(t≤0\.5t\\leq 0\.5\), indicating that local syntactic well\-formedness relies on dense context\. Consistency peaks att=0\.6t\{=\}0\.6, where moderate context ablation optimally probes source alignment\. Coherence and relevance peak at high masking rates \(t=0\.9t\{=\}0\.9\), showing that sparse context forces the model to capture global semantic cues\.

We verify this temporal structure via 5\-fold cross\-validated weight optimization\. Learned weights concentrate att=0\.9t\{=\}0\.9for coherence \(72\.6%\) and relevance \(40\.0%\), whereas fluency relies on early timesteps \(t≤0\.4t\\leq 0\.4\)\. This confirms that the multi\-timestep structure provides a principled quality decomposition structurally absent in AR frameworks \(Appendix[J](https://arxiv.org/html/2605.11601#A10)\)\.

### 6\.2Bidirectional PMI Decomposition

Table 4:PMI decomposition on adversarial SummEval variants\.Cond: conditional score;Mar: marginal \(fluency\) score;PMI: relevance signal \(=Cond−Mar=\\text\{Cond\}\-\\text\{Mar\}\)\. All pairwise differences are significant atp<10−5p<10^\{\-5\}\(Mann–WhitneyUU\) unless marked with†\.We validate the PMI decomposition \(§[3\.2](https://arxiv.org/html/2605.11601#S3.SS2)\) via adversarial testing on SummEval \(Table[4](https://arxiv.org/html/2605.11601#S6.T4)\)\. We construct two perturbation conditions: fluent but irrelevant summaries, and disfluent but relevant ones\. If PMI correctly isolates relevance from fluency, it should collapse under the first condition and remain high under the second\.

Results confirm this separation\. For fluent but irrelevant candidates, the marginal score remains high \(p=0\.32p\{=\}0\.32\), yet PMI collapses from\+1\.88\+1\.88to\+0\.11\+0\.11, showing the source provides no predictive advantage\. For disfluent but relevant candidates, the marginal score drops sharply while PMI remains robust \(\+1\.77\+1\.77\), preserving the relevance signal\.DiffScoreachieves tighter score distributions thanBARTScore, enhancing the statistical reliability of this decomposition\.

### 6\.3Positional Fairness and Directional Consistency

![Refer to caption](https://arxiv.org/html/2605.11601v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.11601v1/x4.png)

Figure 3:Left:Per\-position score distributions on SummEval\.Right:Directional consistency on 200 synthetic reversal pairs\.AR factorization systematically under\-conditions early tokens and is vulnerable to the Reversal Curse[berglund2024the](https://arxiv.org/html/2605.11601#bib.bib10)\.DiffScoremitigates these issues via bidirectional context\. Figure[3](https://arxiv.org/html/2605.11601#S6.F3)\(left\) shows per\-position token\-level score distributions on SummEval\.DiffScoreexhibits a mean positional standard deviation of 2\.31, compared to 5\.61 forBARTScore, a2\.4×2\.4\\timesreduction in positional bias\.

We further evaluate directional consistency on 200 synthetic forward\-reverse pairs \(Figure[3](https://arxiv.org/html/2605.11601#S6.F3), right\)\.DiffScoreachieves a mean consistency of 0\.885 with rank correlation 0\.471, a 76% improvement overBARTScore\. This confirms that marginalizing over random masking patterns creates an intrinsically symmetric evaluation substrate \(Appendix[L](https://arxiv.org/html/2605.11601#A12),[M](https://arxiv.org/html/2605.11601#A13)\)\.

### 6\.4Generalization Ability

To determine whether advantages stem from the masked reconstruction paradigm or LLaDA’s architecture, we instantiate our framework on Dream[ye2025dream](https://arxiv.org/html/2605.11601#bib.bib13), a 7B\-parameter MDLLM trained from scratch on 200B tokens\. Zero\-shot, Dream yields near\-random correlations, suggesting that an MDLLM requires substantial pre\-training scale and instruction tuning to internalize the linguistic priors needed for evaluation\. Fine\-tuning, however, fully unlocks Dream’s performance, achieving results comparable toBARTScoreon MT and summarization benchmarks\. This indicates that the masked reconstruction objective provides a universally effective inductive bias for text evaluation, even with weak initial representations\. The LLaDA\-basedDiffScoreconsistently outperforms its Dream counterpart, confirming that stronger pre\-trained representations directly improve evaluation capabilities \(Appendix[N](https://arxiv.org/html/2605.11601#A14)\)\.

### 6\.5Ablation Study

Table 5:Ablation study on SummEval \(Spearmanρ\\rho, zero\-shotDiffScore\)\. Default configuration \(†\):K=20K\{=\}20,T=10T\{=\}10, random masking\.Table[5](https://arxiv.org/html/2605.11601#S6.T5)examines sensitivity to key design choices using zero\-shotDiffScoreon SummEval\. In each block,boldindicates the best result\. Performance saturates atK=20K\{=\}20masking patterns\. A timestep resolution ofT=10T\{=\}10yields the best average correlation; finer discretization introduces estimation noise\. Uniform random masking substantially outperforms structured alternatives \(e\.g\., entity\-only masking\), confirming that comprehensive token\-type coverage is essential\.DiffScoreis robust to prompt formulation, with average correlations fluctuating by only 0\.010 across seven templates \(Appendix[I](https://arxiv.org/html/2605.11601#A9), Table[11](https://arxiv.org/html/2605.11601#A9.T11)\)\.

Regarding computational cost, Monte Carlo estimation requiresKKforward passes\. However, a lightweight variant \(K=5K\{=\}5\) retains over 95% of the standard correlation at3\.7×3\.7\\timeslower latency\. Future work leveraging amortized inference could further reduce this cost \(Appendix[H](https://arxiv.org/html/2605.11601#A8)\)\.

Hyperparameter selection\.All hyperparameters \(K=20K\{=\}20,T=10T\{=\}10, random masking, MLP weighting\) were selected based on the convergence analysis in Appendix[F](https://arxiv.org/html/2605.11601#A6)and the ablation above, without tuning on evaluation benchmarks\. The fine\-tuning configuration \(Table[7](https://arxiv.org/html/2605.11601#A4.T7)\) strictly mirrors BARTScore’s setup \(same training data, epochs, and optimization\), differing only in model architecture and reconstruction objective\.

## 7Conclusion

We introduceDiffScore, a text evaluation framework that reconceptualizes quality assessment as masked reconstruction\. Leveraging the ELBO of MDLLMs,DiffScoreeliminates the positional bias inherent in autoregressive scoring while enabling diagnostic tools \(multi\-timestep quality profiles and bidirectional PMI decomposition\) structurally unattainable in AR frameworks\. Experiments across 10 benchmarks confirm thatDiffScore\-FTconsistently outperformsBARTScoreunder controlled comparison, while the zero\-shot variant surpasses GPTScore without task\-specific adaptation\. Cross\-architecture experiments validate that these advantages stem from the masked reconstruction paradigm itself rather than a specific model\. Built on open\-weight MDLLMs,DiffScoreprovides a competitive and fully reproducible alternative to proprietary LLM\-based evaluators\. Limitations and future directions are discussed in Appendix[S](https://arxiv.org/html/2605.11601#A19)\.

## References

- \[1\]Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A\. Smith\.All that’s ‘human’ is not gold: Evaluating human evaluation of generated text\.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\), pages 7282–7296, Online, August 2021\. Association for Computational Linguistics\.
- \[2\]Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam\.Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text\.Journal of Artificial Intelligence Research, 77:103–166, 2023\.
- \[3\]Kishore Papineni, Salim Roukos, Todd Ward, and Wei\-Jing Zhu\.Bleu: a method for automatic evaluation of machine translation\.In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002\. Association for Computational Linguistics\.
- \[4\]Chin\-Yew Lin\.ROUGE: A package for automatic evaluation of summaries\.InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004\. Association for Computational Linguistics\.
- \[5\]Tianyi Zhang\*, Varsha Kishore\*, Felix Wu\*, Kilian Q\. Weinberger, and Yoav Artzi\.Bertscore: Evaluating text generation with bert\.InInternational Conference on Learning Representations, 2020\.
- \[6\]Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M\. Meyer, and Steffen Eger\.MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance\.In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\), pages 563–578, Hong Kong, China, November 2019\. Association for Computational Linguistics\.
- \[7\]Weizhe Yuan, Graham Neubig, and Pengfei Liu\.Bartscore: Evaluating generated text as text generation\.Advances in neural information processing systems, 34:27263–27277, 2021\.
- \[8\]Jinlan Fu, See\-Kiong Ng, Zhengbao Jiang, and Pengfei Liu\.GPTScore: Evaluate as you desire\.In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), pages 6556–6576, Mexico City, Mexico, June 2024\. Association for Computational Linguistics\.
- \[9\]Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-eval: NLG evaluation using gpt\-4 with better human alignment\.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023\. Association for Computational Linguistics\.
- \[10\]Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans\.The reversal curse: LLMs trained on “a is b” fail to learn “b is a”\.InThe Twelfth International Conference on Learning Representations, 2024\.
- \[11\]Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov\.Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems, 37:130136–130184, 2024\.
- \[12\]Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji\-Rong Wen, and Chongxuan Li\.Large language diffusion models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems, 2025\.
- \[13\]Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong\.Dream 7b: Diffusion large language models\.arXiv preprint arXiv:2508\.15487, 2025\.
- \[14\]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer\.BART: Denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\.In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020\. Association for Computational Linguistics\.
- \[15\]Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev\.Summeval: Re\-evaluating summarization evaluation\.Transactions of the Association for Computational Linguistics, 9:391–409, 2021\.
- \[16\]Max Grusky, Mor Naaman, and Yoav Artzi\.Newsroom: A dataset of 1\.3 million summaries with diverse extractive strategies\.In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\), pages 708–719, New Orleans, Louisiana, June 2018\. Association for Computational Linguistics\.
- \[17\]Manik Bhandari, Pranav Narayan Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig\.Re\-evaluating evaluation in text summarization\.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), pages 9347–9359, Online, November 2020\. Association for Computational Linguistics\.
- \[18\]Tobias Falke, Leonardo F\. R\. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych\.Ranking generated summaries by correctness: An interesting but challenging application for natural language inference\.In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy, July 2019\. Association for Computational Linguistics\.
- \[19\]Alex Wang, Kyunghyun Cho, and Mike Lewis\.Asking and answering questions to evaluate the factual consistency of summaries\.In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online, July 2020\. Association for Computational Linguistics\.
- \[20\]Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette Graham\.Results of the WMT19 metrics shared task: Segment\-level and strong MT systems pose big challenges\.In Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, and Karin Verspoor, editors,Proceedings of the Fourth Conference on Machine Translation \(Volume 2: Shared Task Papers, Day 1\), pages 62–90, Florence, Italy, August 2019\. Association for Computational Linguistics\.
- \[21\]François Mairesse, Milica Gašić, Filip Jurčíček, Simon Keizer, Blaise Thomson, Kai Yu, and Steve Young\.Phrase\-based statistical language generation using graphical models and active learning\.In Jan Hajič, Sandra Carberry, Stephen Clark, and Joakim Nivre, editors,Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1552–1561, Uppsala, Sweden, July 2010\. Association for Computational Linguistics\.
- \[22\]Tsung\-Hsien Wen, Milica Gašić, Nikola Mrkšić, Pei\-Hao Su, David Vandyke, and Steve Young\.Semantically conditioned LSTM\-based natural language generation for spoken dialogue systems\.In Lluís Màrquez, Chris Callison\-Burch, and Jian Su, editors,Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721, Lisbon, Portugal, September 2015\. Association for Computational Linguistics\.
- \[23\]Satanjeev Banerjee and Alon Lavie\.METEOR: An automatic metric for MT evaluation with improved correlation with human judgments\.In Jade Goldstein, Alon Lavie, Chin\-Yew Lin, and Clare Voss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005\. Association for Computational Linguistics\.
- \[24\]Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han\.Towards a unified multi\-dimensional evaluator for text generation\.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates, December 2022\. Association for Computational Linguistics\.
- \[25\]Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu\.AlignScore: Evaluating factual consistency with a unified alignment function\.In Anna Rogers, Jordan Boyd\-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 11328–11348, Toronto, Canada, July 2023\. Association for Computational Linguistics\.
- \[26\]Thomas Scialom, Paul\-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari\.QuestEval: Summarization asks for fact\-based evaluation\.In Marie\-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen\-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic, November 2021\. Association for Computational Linguistics\.
- \[27\]Edward J Williams\.The comparison of regression variables\.Journal of the Royal Statistical Society: Series B \(Methodological\), 21\(2\):396–399, 1959\.
- \[28\]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-rank adaptation of large language models\.InInternational Conference on Learning Representations, 2022\.

## Appendix ATheoretical Analysis

This section provides formal analysis of the structural advantages of masked reconstruction over autoregressive scoring, complementing the empirical evidence in the main text\.

### A\.1Structural Relationship to Autoregressive Scoring

We formalize the relationship betweenDiffScoreand autoregressive \(AR\) scoring\. The AR log\-probability of a sequence𝐱=\(x1,…,xL\)\\mathbf\{x\}=\(x^\{1\},\\ldots,x^\{L\}\)follows a fixed left\-to\-right factorization:

log⁡pAR\(𝐱\)=∑n=1Llog⁡pAR\(xn∣x<n\)\.\\log p\_\{\\mathrm\{AR\}\}\(\\mathbf\{x\}\)=\\sum\_\{n=1\}^\{L\}\\log p\_\{\\mathrm\{AR\}\}\(x^\{n\}\\mid x^\{<n\}\)\.\(13\)
This can be viewed as a*degenerate case*of the ELBO framework used byDiffScore\. Specifically, consider a deterministic masking schedule where, at “timestep”nn, all tokens from positionnntoLLare masked and all tokens before positionnnare unmasked\. Under this schedule, evaluating tokenxnx^\{n\}utilizes exactly the contextx<nx^\{<n\}, reproducing the AR factorization\. The MDLLM\-based ELBO generalizes this by marginalizing over*all random masking patterns*:

ELBO\(𝐱0;θ\)=𝔼t∼𝒰\(0,1\)𝔼𝐱t∼q\(𝐱t∣𝐱0\)\[1t∑i∈ℳtlog⁡pθ\(x0i∣𝐱t\)\]\.\\mathrm\{ELBO\}\(\\mathbf\{x\}\_\{0\};\\theta\)=\\mathbb\{E\}\_\{t\\sim\\mathcal\{U\}\(0,1\)\}\\,\\mathbb\{E\}\_\{\\mathbf\{x\}\_\{t\}\\sim q\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\)\}\\left\[\\frac\{1\}\{t\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\}\)\\right\]\.\(14\)
The key distinction is that AR scoring evaluates each token under a*single, fixed*context configuration, whereasDiffScoreevaluates each token under an*expectation over exponentially many*context configurations\. This marginalization yields a more robust quality estimate that is invariant to the specific factorization order\.

### A\.2Training Objective Alignment

The effectiveness of model\-based evaluation metrics depends critically on how well the model’s training objective aligns with the evaluation task\. We analyze this alignment for both AR and MDLLM\-based metrics\.

#### Autoregressive models\.

AR models are trained to maximize∑nlog⁡p\(xn∣x<n\)\\sum\_\{n\}\\log p\(x^\{n\}\\mid x^\{<n\}\), minimizing the per\-token cross\-entropy in a left\-to\-right order\. While this yields a valid estimate of sequence probability, the objective is inherently*local*: each prediction conditions only on preceding tokens\. Text evaluation, however, requires*global*quality assessment—a high\-quality text must be coherent, faithful, and fluent as a whole, not merely locally plausible at each position\.

#### Masked diffusion models\.

MDLLMs are trained to maximize𝔼t,𝐱t\[∑i∈ℳtlog⁡pθ\(x0i∣𝐱t\)\]\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{t\}\}\\left\[\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\}\)\\right\]across uniformly sampled masking rates\. Each masked token is predicted using full bidirectional context from all unmasked tokens\. This objective is*natively aligned*with text evaluation: the model must understand the text holistically to reconstruct any subset of tokens from the remaining context\. The masking ratettcontinuously interpolates between local and global understanding\.

#### Fine\-tuning alignment\.

When fine\-tuning on task\-specific corpora \(e\.g\., CNN/DailyMail for summarization\), both paradigms learn domain\-specific patterns\. However, the alignment advantage persists:

- •BARTScore\(BART\-large\-CNN\): learns “given a document, a high\-quality summary is more likely to be*generated*left\-to\-right\.”
- •DiffScore\-FT: learns “given a document, a high\-quality summary is more likely to be*reconstructed*from any partial observation\.”

The latter is a strictly stronger condition, as it requires the text to be recoverable under all possible context ablations, not just the left\-to\-right order\.

### A\.3Formal Analysis of Positional Bias

We define positional bias formally\. For a scoring functionffapplied to a sequence𝐱\\mathbf\{x\}of lengthLL, lets\(n\)s\(n\)denote the expected per\-token score contribution at positionnn:

sAR\(n\)=log⁡pAR\(xn∣x<n\),sDiff\(n\)=𝔼t,𝐱t\[log⁡pθ\(xn∣𝐱t\)⋅𝟙\[xtn=\[M\]\]\]\.s\_\{\\mathrm\{AR\}\}\(n\)=\\log p\_\{\\mathrm\{AR\}\}\(x^\{n\}\\mid x^\{<n\}\),\\qquad s\_\{\\mathrm\{Diff\}\}\(n\)=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{t\}\}\\left\[\\log p\_\{\\theta\}\(x^\{n\}\\mid\\mathbf\{x\}\_\{t\}\)\\cdot\\mathbb\{1\}\[x\_\{t\}^\{n\}=\\texttt\{\[M\]\}\]\\right\]\.\(15\)
Positional bias metric\.We quantify positional bias via the coefficient of variation \(CoV\) of per\-position score distributions:

PosBias\(f\)=Std\(\{s\(n\)\}n=1L\)\|Mean\(\{s\(n\)\}n=1L\)\|\.\\mathrm\{PosBias\}\(f\)=\\frac\{\\mathrm\{Std\}\(\\\{s\(n\)\\\}\_\{n=1\}^\{L\}\)\}\{\\left\|\\mathrm\{Mean\}\(\\\{s\(n\)\\\}\_\{n=1\}^\{L\}\)\\right\|\}\.\(16\)
For AR models,sAR\(1\)s\_\{\\mathrm\{AR\}\}\(1\)conditions on no left context \(only the source, if present\), whilesAR\(L\)s\_\{\\mathrm\{AR\}\}\(L\)conditions on all preceding tokens\. This creates a systematic monotonic trend in the expected context quality, introducing positional bias\. In contrast,sDiff\(n\)s\_\{\\mathrm\{Diff\}\}\(n\)conditions on random subsets of tokens from*both directions*, and the expectation over masking patterns ensures that each position receives, on average, comparable amounts of contextual information, regardless of its location in the sequence\.

Our empirical measurements \(Appendix[L](https://arxiv.org/html/2605.11601#A12)\) confirm this:DiffScoreachieves a CoV of 0\.174 versus 0\.181 forBARTScore, with a2\.4×2\.4\\timesreduction in mean positional standard deviation \(2\.31 vs\. 5\.61\)\.

## Appendix BDataset Details

Table[6](https://arxiv.org/html/2605.11601#A2.T6)provides a comprehensive summary of the 10 evaluation benchmarks used in this work\. All datasets are publicly available and have been previously used for meta\-evaluation of NLG metrics\.

Table 6:Summary of evaluation benchmarks\.TaskDatasetDimensions\#SamplesLevelMetricSUMSummEvalCOH, FAC, FLU, INFO1,600SegmentSpearmanρ\\rhoNewsroomCOH, FLU, INFO, REL60SegmentSpearmanρ\\rhoREALSummCOV100SystemPearsonrrRank19FAC373PairwiseAccuracyQAGS\-CNNFAC235SegmentPearsonrrQAGS\-XSUMFAC239SegmentPearsonrrMTWMT19Adequacy7 lang\. pairsSegmentKendallτ\\tauD2TBAGELINF, NAT, QUA404SystemSpearmanρ\\rhoSFHOTINF, NAT, QUA875SystemSpearmanρ\\rhoSFRESINF, NAT, QUA1,181SystemSpearmanρ\\rhoSummarization\.SummEval\[[15](https://arxiv.org/html/2605.11601#bib.bib15)\]provides expert annotations of 1,600 summaries across four dimensions: coherence, consistency \(factuality\), fluency, and relevance\. Newsroom\[[16](https://arxiv.org/html/2605.11601#bib.bib16)\]contains 60 summaries with four\-dimensional human annotations\. REALSumm\[[17](https://arxiv.org/html/2605.11601#bib.bib17)\]evaluates system\-level coverage using LitePyramid recall over 25 systems\. Rank19\[[18](https://arxiv.org/html/2605.11601#bib.bib18)\]provides pairwise factuality comparisons for 373 summary pairs\. QAGS\-CNN and QAGS\-XSUM\[[19](https://arxiv.org/html/2605.11601#bib.bib19)\]measure factual consistency via question\-answering overlap\.

Machine Translation\.We use 7 language pairs from the WMT19 Metrics Shared Task\[[20](https://arxiv.org/html/2605.11601#bib.bib20)\]: de\-en, fi\-en, gu\-en, kk\-en, lt\-en, ru\-en, and zh\-en\. Each pair contains segment\-level direct assessment scores\.

Data\-to\-Text\.BAGEL\[[21](https://arxiv.org/html/2605.11601#bib.bib21)\]and SFHOT/SFRES\[[22](https://arxiv.org/html/2605.11601#bib.bib22)\]provide system\-level annotations for informativeness, naturalness, and overall quality of texts generated from structured meaning representations\.

## Appendix CBaseline Method Details

We provide extended descriptions of all baseline methods to clarify their mechanisms and the comparisons drawn in the main text\.

#### Lexical Overlap Metrics\.

BLEU\[[3](https://arxiv.org/html/2605.11601#bib.bib3)\]computes modifiednn\-gram precision with a brevity penalty, originally designed for MT evaluation\. We use SacreBLEU with default settings\.ROUGE\[[4](https://arxiv.org/html/2605.11601#bib.bib4)\]measures recall\-orientednn\-gram overlap; we report ROUGE\-1 \(unigram\), ROUGE\-2 \(bigram\), and ROUGE\-L \(longest common subsequence\)\. These metrics serve as lower\-bound baselines, as they rely purely on surface\-level matching without semantic understanding\.METEOR\[[23](https://arxiv.org/html/2605.11601#bib.bib23)\]extends BLEU with stemming, synonymy matching, and a fragmentation penalty, providing somewhat better correlation with human judgments\.

#### Embedding\-based Metrics\.

BERTScore\[[5](https://arxiv.org/html/2605.11601#bib.bib5)\]computes soft token\-level alignments between candidate and reference using contextual embeddings from a pre\-trained BERT model, reporting precision, recall, and F1\. We use the recommendedroberta\-largecheckpoint with IDF weighting\.MoverScore\[[6](https://arxiv.org/html/2605.11601#bib.bib6)\]extends BERTScore by solving an Earth Mover’s Distance problem over contextualized embeddings, measuring the minimum cost to transform the candidate embedding distribution into the reference distribution\.

#### Autoregressive Probability Metrics\.

BARTScore\[[7](https://arxiv.org/html/2605.11601#bib.bib7)\]is our primary comparison target\. It scores text via the conditional log\-likelihood under a fine\-tuned BART\-large\-CNN model \(406M parameters\)\. We use the four directional variants \(marginal, conditional, reverse, bidirectional\) following the original paper’s dimension\-specific protocol\. BARTScore’s competitive performance stems from BART’s pre\-training \(denoising autoencoder\) and task\-specific fine\-tuning on CNN/DailyMail\.GPTScore\[[8](https://arxiv.org/html/2605.11601#bib.bib8)\]uses GPT\-2\-large \(774M parameters\) as a zero\-shot evaluator via conditional log\-likelihood\. It serves as the zero\-shot AR baseline for comparison withDiffScore\-Zero\.

#### Supervised Multi\-dimensional Evaluators\.

UniEval\[[24](https://arxiv.org/html/2605.11601#bib.bib24)\]reformulates NLG evaluation as Boolean question answering, training a T5\-based model with dimension\-specific questions \(e\.g\., “Is this text coherent?”\)\. While achieving high correlation on trained dimensions, it requires extensive data curation for each evaluation aspect\.AlignScore\[[25](https://arxiv.org/html/2605.11601#bib.bib25)\]trains a unified alignment function on 4\.7M examples from 7 tasks \(NLI, QA, paraphrasing, etc\.\) to evaluate factual consistency\. It excels at factuality\-focused evaluations but has limited coverage of other quality dimensions\.QuestEval\[[26](https://arxiv.org/html/2605.11601#bib.bib26)\]generates questions from both source and candidate, then measures answer overlap to assess content preservation\.

#### LLM\-as\-a\-Judge\.

G\-Eval\[[9](https://arxiv.org/html/2605.11601#bib.bib9)\]prompts GPT\-4 with chain\-of\-thought instructions and a form\-filling paradigm to produce quality scores\. We use the official prompts and average overn=20n\{=\}20API calls per sample with probability\-weighted scoring\. While G\-Eval achieves high correlations on some benchmarks, it relies on a proprietary API, is non\-reproducible due to model updates, and incurs significant cost \($0\.03–0\.06 per sample at GPT\-4 pricing\)\.

## Appendix DTraining and Implementation Details

### D\.1Model and Infrastructure

DiffScoreis instantiated on two MDLLM backbones: LLaDA\-8B\[[12](https://arxiv.org/html/2605.11601#bib.bib12)\]\(primary\) and Dream\-7B\[[13](https://arxiv.org/html/2605.11601#bib.bib13)\]\(generalization study\)\. For zero\-shot evaluation \(DiffScore\-Zero\), we use the instruction\-tuned variant \(LLaDA\-8B\-Instruct\)\. For domain\-adapted evaluation \(DiffScore\-FT\), we fine\-tune the base model \(LLaDA\-8B\-Base\) using LoRA\[[28](https://arxiv.org/html/2605.11601#bib.bib28)\]with rankr=16r\{=\}16,α=32\\alpha\{=\}32, and dropout0\.050\.05\.

### D\.2Fine\-Tuning Configuration

Table 7:Fine\-tuning hyperparameters forDiffScore\-FT\.
### D\.3Data Formatting for Fine\-Tuning

Training samples are formatted using a chat template to ensure compatibility with instruction\-tuned MDLLMs\. The format is as follows:

Table 8:SFT data formatting template\. Only theassistantturn participates in the masked reconstruction loss\.This design ensures that during training, the model learns domain\-specific reconstruction patterns for the candidate text \(summary\) while the source \(document\) remains fully visible as context, mirroring the evaluation\-time setup\. Themask\_prompt\_loss=Trueflag excludes all non\-candidate tokens from the reconstruction loss, preventing the model from wasting capacity on reconstructing the source document\.

### D\.4Multi\-Domain Extension

The same fine\-tuning framework extends to other NLG domains by substituting the training corpus:

- •Machine Translation: WMT parallel corpora, with source language as user input and target language as assistant output\. This yieldsDiffScore\-FT\-WMT\.
- •Data\-to\-Text: WebNLG or E2E datasets, with structured data as user input and natural language description as assistant output\. This yieldsDiffScore\-FT\-D2T\.

In our main experiments,DiffScore\-FTis fine\-tuned exclusively on CNN/DailyMail to maintain a strictly fair comparison with BARTScore \(BART\-large\-CNN\)\. Notably, the strong D2T results achieved byDiffScore\-FT\(Table[3](https://arxiv.org/html/2605.11601#S5.T3)\) represent*out\-of\-domain*generalization, as the model was never trained on D2T data\.

### D\.5Inference Configuration

For the standardDiffScoreconfiguration, we sampleK=20K\{=\}20masking patterns acrossT=10T\{=\}10uniformly spaced timesteps in\(0,1\]\(0,1\]\. For the lightweightDiffScore\-Fastvariant, we useK=5K\{=\}5andT=5T\{=\}5\. All masking patterns are drawn independently and uniformly at random\. Since theKKforward passes are independent, they are fully parallelizable on modern hardware\.

## Appendix EScoring Configuration Selection Rationale

The main text \(§[3\.1](https://arxiv.org/html/2605.11601#S3.SS1)\) defines four scoring configurations\. Here we provide the detailed rationale for matching configurations to evaluation dimensions, following and extending the protocol established by BARTScore\[[7](https://arxiv.org/html/2605.11601#bib.bib7)\]\.

#### Machine Translation\.

For MT adequacy, we use the conditional scoreDiffScorecond\(𝐜∣𝐫\)\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\(\\mathbf\{c\}\\mid\\mathbf\{r\}\), where𝐫\\mathbf\{r\}is the reference translation\. The reference remains fully visible while the candidate is masked and reconstructed\. The reconstruction probability directly measures how well the candidate preserves the semantic content of the reference: a high\-quality translation will be easily reconstructable given the reference, as both express the same meaning\. We do not use the marginal configuration here because MT adequacy is fundamentally a source\-conditioned property\.

#### Summarization\.

The multi\-dimensional nature of summarization evaluation requires different configurations for different quality aspects:

- •Faithfulness / Consistency\(DiffScorecond\\textsc\{DiffScore\}\_\{\\mathrm\{cond\}\}\): The source document provides the factual grounding\. A faithful summary should be easily reconstructable from the source, as all its content can be verified against the document\. This directly probes whether the summary introduces hallucinated content\.
- •Coverage\(DiffScorerev\\textsc\{DiffScore\}\_\{\\mathrm\{rev\}\}\): By masking the source and keeping the candidate visible, we measure how much of the source can be recovered from the summary\. A comprehensive summary enables reconstruction of the key source content\.
- •Fluency\(DiffScoremar\\textsc\{DiffScore\}\_\{\\mathrm\{mar\}\}\): Fluency is an intrinsic property of the text, independent of the source\. The marginal configuration evaluates the candidate in isolation, measuring how well it conforms to the model’s learned linguistic priors\.
- •Coherence / Relevance / Informativeness\(DiffScorebi\\textsc\{DiffScore\}\_\{\\mathrm\{bi\}\}\): These dimensions require both source fidelity and intrinsic quality\. The bidirectional score balances both directions, capturing the holistic quality of the summary\.

#### Data\-to\-Text\.

D2T generation requires both structural fidelity to structured data records \(informativeness\) and linguistic naturalness\. The bidirectional configurationDiffScorebi\\textsc\{DiffScore\}\_\{\\mathrm\{bi\}\}jointly captures both aspects, as the structured source provides factual constraints while the candidate’s intrinsic quality reflects naturalness\.

#### Weighting function selection\.

We default to the mean log\-probability \(MLP\) weightingω\(tk\)=1/\|ℳtk\|\\omega\(t\_\{k\}\)=1/\|\\mathcal\{M\}\_\{t\_\{k\}\}\|rather than the strict ELBO weightingω\(tk\)=1/tk\\omega\(t\_\{k\}\)=1/t\_\{k\}for practical reasons\. At low masking rates \(smalltt\), the ELBO weighting amplifies the contribution of a few masked tokens, leading to high variance\. The MLP weighting normalizes by the actual number of masked tokens, yielding more stable estimates while preserving the relative ordering of text quality\.

## Appendix FMonte Carlo Convergence Analysis

![Refer to caption](https://arxiv.org/html/2605.11601v1/x5.png)Figure 4:Monte Carlo convergence ofDiffScoreas a function of the number of sampled masking patternsKK\. Correlation stabilizes atK≥20K\\geq 20, with diminishing returns beyondK=50K\{=\}50\.SinceDiffScoreestimates the ELBO via Monte Carlo sampling, a natural concern is the variance introduced by finite sampling\. Figure[4](https://arxiv.org/html/2605.11601#A6.F4)shows how the average Spearman correlation with human judgments evolves as a function ofKKon SummEval\. The estimates converge rapidly:K=20K\{=\}20captures96\.1%96\.1\\%of theK=50K\{=\}50correlation, and the standard deviation across 10 independent runs drops below 0\.005 atK=20K\{=\}20\. This confirms that the defaultK=20K\{=\}20provides a reliable estimate with negligible variance overhead\.

Table[9](https://arxiv.org/html/2605.11601#A6.T9)provides per\-dimension convergence statistics, showing that all four quality dimensions stabilize atK=20K\{=\}20\.

Table 9:Per\-dimension Spearmanρ\\rhoas a function ofKKon SummEval, demonstrating convergence across all quality dimensions\.
## Appendix GSanity Check: Score Validity

We conduct four sanity checks to confirm thatDiffScoreproduces meaningful evaluation signals\.

\(1\) Correlation validity\.On SummEval with the conditional configuration,DiffScoreachieves Spearman correlations of 0\.428 \(COH\), 0\.481 \(CON\), 0\.331 \(FLU\), and 0\.436 \(REL\) with human judgments, all statistically significant atp<10−8p<10^\{\-8\}\. The corresponding Kendallτ\\tauvalues are 0\.312 \(COH\), 0\.380 \(CON\), 0\.256 \(FLU\), and 0\.314 \(REL\), confirming consistent ranking quality across both rank correlation measures\.

\(2\) Random source control\.Replacing the correct source document with a randomly sampled document causes the conditional score to drop sharply \(from−0\.812\-0\.812to−3\.087\-3\.087,Δ=2\.275\\Delta=2\.275\), confirming thatDiffScoregenuinely captures source–candidate alignment rather than producing degenerate scores independent of the source\.

\(3\) Prompt template stability\.As detailed in Appendix[I](https://arxiv.org/html/2605.11601#A9), scores remain stable across prompt variations \(Avg\.σ=0\.010\\sigma=0\.010\)\.

\(4\) Monte Carlo convergence\.Per\-dimension correlations stabilize atK=20K\{=\}20across all four quality dimensions \(Appendix[F](https://arxiv.org/html/2605.11601#A6), Table[9](https://arxiv.org/html/2605.11601#A6.T9)\), with consistency and relevance showing continued mild improvement atK=50K\{=\}50\.

## Appendix HComputational Efficiency Analysis

Table 10:Efficiency–performance trade\-off acrossDiffScoreconfigurations\. Per\-sample latency measured on a single NVIDIA A100 GPU\.![Refer to caption](https://arxiv.org/html/2605.11601v1/x6.png)Figure 5:Pareto frontier of performance vs\. computational cost\.DiffScore\-Fast\(K=5K\{=\}5\) retains over 95% of standard performance at3\.7×3\.7\\timeslower latency\.Table[10](https://arxiv.org/html/2605.11601#A8.T10)compares the latency–performance trade\-off\. WhileDiffScoreincurs higher per\-sample cost than BARTScore \(0\.88s vs\. 0\.01s\) due to the 8B\-parameter backbone andKK\-fold sampling, the lightweightDiffScore\-Fastvariant \(K=5K\{=\}5,T=5T\{=\}5\) retains 95\.2% of the standard correlation at3\.7×3\.7\\timeslower latency \(0\.24s per sample\)\. In practice, theKKforward passes are fully parallelizable, and throughput can be further improved with batched inference\. Compared to G\-Eval, which requires multiple API calls to GPT\-4,DiffScoreoffers a fully open\-weight alternative with deterministic and reproducible evaluation\.

#### Cost comparison with G\-Eval\.

At the time of writing, GPT\-4 API pricing is approximately $0\.03–0\.06 per evaluation sample \(withn=20n\{=\}20sampled responses\)\. For SummEval \(1,600 samples\), G\-Eval costs approximately $48–96 per full evaluation\.DiffScore\-Standardrequires∼\\sim23 minutes of A100 compute for the same evaluation \(1,600×\\times0\.88s\), corresponding to approximately $1\.15 at typical cloud GPU rates \($3/hour for A100\)\. This represents a 40–80×\\timescost reduction while achieving comparable or better correlation\.

## Appendix IPrompt Sensitivity Analysis

Table 11:Prompt sensitivity of zero\-shotDiffScoreon SummEval \(Spearmanρ\\rho\)\. Seven distinct prompt templates yield a standard deviation of only 0\.010 across dimensions\.A potential concern for any LLM\-based evaluator is sensitivity to prompt design\. Table[11](https://arxiv.org/html/2605.11601#A9.T11)demonstrates thatDiffScoreis remarkably robust across seven distinct prompt formulations, with per\-dimension standard deviations ranging from 0\.010 \(FLU\) to 0\.022 \(COH\)\. This robustness stems from the fact thatDiffScorerelies on reconstruction probability rather than instruction\-following, making the exact prompt wording less critical than in generation\-based evaluators such as G\-Eval\.

We provide the seven prompt templates used:

Table 12:Prompt templates used for sensitivity analysis\.
## Appendix JMulti\-Timestep Quality Profiles: Extended Analysis

### J\.1Full Timestep×\\timesDimension Correlation Matrix

Table[13](https://arxiv.org/html/2605.11601#A10.T13)presents the complete correlation matrix underlying Figure[2](https://arxiv.org/html/2605.11601#S6.F2)in the main text\. The clear pattern—fluency peaking at low\-to\-midtt, consistency at midtt, and coherence/relevance at hightt—provides strong empirical support for the multi\-granularity hypothesis\.

Table 13:Spearmanρ\\rhobetween single\-timestepDiffScoreand human judgments on SummEval\.Bold: best timestep per dimension\.
### J\.2Cross\-Validated Learned Weights

![Refer to caption](https://arxiv.org/html/2605.11601v1/x7.png)Figure 6:Learned timestep weights via 5\-fold cross\-validation on SummEval\. Different quality dimensions concentrate weight at different masking rates, confirming the multi\-granularity hypothesis\.To validate that the multi\-timestep structure provides a principled quality decomposition, we optimize timestep weights\{wk\}k=1T\\\{w\_\{k\}\\\}\_\{k=1\}^\{T\}via 5\-fold cross\-validation on SummEval for each quality dimension independently\. Figure[6](https://arxiv.org/html/2605.11601#A10.F6)shows the resulting weight distributions\.

The learned weights exhibit a clear dimension\-dependent pattern:

- •Fluency: Weight concentrated at low masking rates \(t≤0\.4t\\leq 0\.4\), with 62\.5% of total weight in the ranget∈\[0\.1,0\.4\]t\\in\[0\.1,0\.4\]\. This reflects that local syntactic well\-formedness is best assessed when rich context is available\.
- •Coherence: 72\.6% of weight att=0\.9t\{=\}0\.9, indicating that global discourse structure is best probed under sparse context that forces reliance on high\-level structural cues\.
- •Relevance: 40\.0% att=0\.9t\{=\}0\.9, with moderate weight at intermediate timesteps, consistent with relevance requiring both local and global assessment\.
- •Consistency: Distributed across mid\-to\-high timesteps \(t∈\[0\.5,0\.8\]t\\in\[0\.5,0\.8\]\), with the largest single weight att=0\.6t\{=\}0\.6\(23\.2%\)\. This reflects that factual verification requires moderate context ablation to probe whether the summary’s claims are grounded in the source\.

Table[14](https://arxiv.org/html/2605.11601#A10.T14)reports the quantitative improvement from learned weights over uniform weighting\.

Table 14:Comparison of uniform vs\. learned timestep weights \(5\-fold CV on SummEval\)\. Learned weights yield consistent improvements for coherence and relevance, while fluency and consistency are relatively invariant to weighting\.
### J\.3Quality Profile Curves

![Refer to caption](https://arxiv.org/html/2605.11601v1/x8.png)Figure 7:Quality profile curves for high\-, median\-, and low\-quality summaries on SummEval\. Higher\-quality summaries maintain higher reconstruction scores across all masking rates, with the gap widening at hightt\.Figure[7](https://arxiv.org/html/2605.11601#A10.F7)visualizes the quality profile curves for summaries of different quality tiers\. High\-quality summaries consistently achieve higher reconstruction scores across all timesteps\. Notably, the quality gap widens at high masking rates \(t\>0\.7t\>0\.7\), suggesting that global coherence is the most discriminative dimension for distinguishing summary quality\. This provides intuitive support for the multi\-timestep decomposition\.

## Appendix KPMI Decomposition: Extended Analysis

### K\.1Adversarial Test Set Construction

The PMI decomposition experiment \(§[6\.2](https://arxiv.org/html/2605.11601#S6.SS2)\) uses two adversarial perturbation conditions constructed from SummEval\. We detail the construction procedure here\.

#### Fluent\-irrelevant candidates\.

For each source document in SummEval, we replace the original candidate summary with a high\-quality summary randomly sampled from a*different*SummEval source document\. This produces candidates that are intrinsically fluent and well\-formed \(high marginal score\) but topically unrelated to the source \(low conditional gain, hence low PMI\)\. Example:

> Source: \(CNN\) Donald Sterling’s racist remarks cost him an NBA team last year\. But now it’s his former female companion who has lost big… Original summary: V\. Stiviano must pay back $2\.6 million in gifts from Donald Sterling… Fluent\-irrelevant: Harry Kane has been in superb form for Tottenham this season\. The 21\-year\-old has scored 30 goals in all competitions for Spurs…

#### Disfluent\-relevant candidates\.

For each source document, we apply controlled perturbations to the original summary to degrade fluency while preserving topical relevance\. Perturbations include: \(i\) word\-order swaps within clauses, \(ii\) article/preposition substitution, \(iii\) word repetition, and \(iv\) minor deletion\. Example:

> Original: V\. Stiviano must pay back $2\.6 million in gifts from Donald Sterling\. Disfluent\-relevant: V\. must Stiviano pay back $2\.6 million on gift from Donald Sterling\.

### K\.2Detailed PMI Decomposition Results

Table[15](https://arxiv.org/html/2605.11601#A11.T15)extends the main\-text Table[4](https://arxiv.org/html/2605.11601#S6.T4)with standard deviations and statistical test details, providing a more complete picture of the decomposition’s reliability\.

Table 15:Extended PMI decomposition results with standard deviations\. All pairwise differences significant atp<10−5p<10^\{\-5\}\(Mann–WhitneyUU\) unless marked†\.Key observations\.\(1\) For fluent\-irrelevant candidates, the marginal score is statistically indistinguishable from the original \(p=0\.32p=0\.32forDiffScore,p=0\.50p=0\.50forBARTScore\), confirming that fluency is preserved\. \(2\) The PMI collapse is more pronounced forDiffScore\(from\+1\.88\+1\.88to\+0\.11\+0\.11, a94\.1%94\.1\\%reduction\) compared toBARTScore\(from\+2\.61\+2\.61to−0\.26\-0\.26\), indicating cleaner separation\. \(3\)DiffScoreachieves tighter standard deviations across all conditions \(average Std 0\.50 vs\. 0\.55 forBARTScore\), enhancing the statistical reliability of the decomposition\. \(4\) For disfluent\-relevant candidates, PMI retention is high for both methods \(\+1\.77\+1\.77and\+2\.27\+2\.27\), confirming that the relevance signal is robust to fluency degradation\.

## Appendix LPositional Bias: Extended Analysis

![Refer to caption](https://arxiv.org/html/2605.11601v1/x9.png)Figure 8:Per\-position token\-level score distributions\.DiffScoreexhibits a mean positional std of 2\.31 vs\. 5\.61 for BARTScore \(2\.4×2\.4\\timesreduction\)\.Table[16](https://arxiv.org/html/2605.11601#A12.T16)presents the detailed positional bias statistics\. The coefficient of variation \(CoV\) provides a normalized measure that accounts for differences in absolute score magnitude between the two methods\.

Table 16:Positional bias statistics on SummEval\. Lower values indicate more position\-fair evaluation\.The2\.4×2\.4\\timesreduction in mean positional standard deviation demonstrates that bidirectional masking produces substantially more position\-fair evaluation\. Qualitatively, BARTScore’s per\-position distributions show a characteristic “warm\-up” pattern where early tokens \(positions 1–5\) receive systematically lower scores due to impoverished left context\.DiffScore’s distributions are more uniform across positions, as each token is evaluated under random bilateral context regardless of its sequential position\.

## Appendix MDirectional Consistency: Extended Analysis

![Refer to caption](https://arxiv.org/html/2605.11601v1/x10.png)Figure 9:Directional consistency on 200 synthetic forward\-reverse pairs\.### M\.1Test Set Construction

We construct 200 synthetic sequence pairs where the forward and reverse forms express identical factual content and should receive identical quality scores under an unbiased evaluator\. Examples follow the pattern:

> Forward: “Daphne Barrington authored ‘Shattered Light”’ Reverse: “‘Shattered Light’ was authored by Daphne Barrington”

Pairs span diverse relations \(authorship, invention, discovery, founding\) with fictional entities to avoid memorization effects\.

### M\.2Detailed Results

Table 17:Directional consistency results on 200 synthetic reversal pairs\.DiffScoreachieves a 76% improvement in rank correlation between forward and reverse scores \(0\.471 vs\. 0\.267\)\. This demonstrates that marginalizing over random masking patterns produces an intrinsically symmetric evaluation substrate, whereas AR factorization inevitably introduces directional artifacts consistent with the Reversal Curse\[[10](https://arxiv.org/html/2605.11601#bib.bib10)\]\. The mean consistency score \(ratio of min/max score for each pair\) is also higher forDiffScore\(0\.885 vs\. 0\.868\) with lower variance \(0\.082 vs\. 0\.091\), indicating more stable bidirectional evaluation\.

## Appendix NCross\-Architecture Generalization: Dream Results

To verify thatDiffScore’s advantages arise from the masked reconstruction paradigm rather than a specific model, we instantiate the framework on Dream\-7B\[[13](https://arxiv.org/html/2605.11601#bib.bib13)\]\. Table[18](https://arxiv.org/html/2605.11601#A14.T18)reports full results\.

Table 18:Cross\-architecture evaluation with Dream\-7B\. Zero\-shot: Dream yields near\-random correlations\. Fine\-tuned: Dream\-FT achieves competitive results, confirming that the masked reconstruction paradigm itself—not a specific model—drives evaluation quality\.WMT19 \(τ\\tau\)SummEval \(ρ\\rho\)MethodAvg\.Best pairAvg\.Best dim\.Dream\-Zero0\.0510\.206 \(gu\-en\)−0\.144\-0\.1440\.077 \(REL\)Dream\-FT0\.3280\.473 \(zh\-en\)0\.3820\.462 \(CON\)BARTScore0\.3420\.428 \(zh\-en\)0\.3750\.441 \(COH\)DiffScore\-FT \(LLaDA\)0\.3560\.458\(zh\-en\)0\.3850\.486\(CON\)Key findings\.\(1\) Dream\-Zero yields near\-random correlations across all tasks, indicating that a base MDLLM without sufficient pre\-training scale and instruction tuning lacks the linguistic priors for zero\-shot evaluation\. \(2\) After fine\-tuning, Dream\-FT’s performance is fully unlocked and achieves results competitive with BARTScore, confirming that the masked reconstruction objective provides a universally effective inductive bias\. \(3\) LLaDA\-basedDiffScore\-FTconsistently outperforms Dream\-FT, demonstrating that stronger pre\-trained representations translate directly to superior evaluation capabilities\.

This cross\-architecture validation is important for two reasons\. First, it rules out the possibility thatDiffScore’s improvements are due to LLaDA’s specific architecture or pre\-training data rather than the masked reconstruction paradigm\. Second, it suggests that as MDLLMs continue to improve,DiffScorewill benefit directly from these advances\.

## Appendix OCase Studies: Token\-Level Analysis

We present detailed case studies from SummEval to illustrateDiffScore’s interpretability through token\-level score analysis\.

### O\.1High\-Quality Summary

Table 19:Token\-level analysis of a high\-quality summary \(human scores: COH=5\.0, CON=5\.0, FLU=5\.0, REL=4\.3\)\.Function words and high\-frequency tokens receive high reconstruction scores, reflecting the model’s strong language modeling priors\. Notably, proper nouns directly mentioned in the source \(“Holland,” “Tampa”\) score moderately well, indicating successful source–candidate alignment\. Punctuation tokens receive lower scores, as their exact placement is less predictable from context\.

### O\.2Low\-Quality Summary

Table 20:Token\-level analysis of a low\-quality summary \(human scores: COH=1\.0, CON=4\.7, FLU=4\.3, REL=2\.3\)\.The low\-quality summary exhibits near\-random discourse structure \(multiple sentences without logical flow, a rhetorical question at the end\)\. While individual tokens within each sentence achieve reasonable scores \(high consistency with the source\), the transition tokens and discourse markers receive very low scores, reflecting poor global coherence\. The extreme penalty on the final period \(−11\.03\-11\.03\) indicates that the model finds the abrupt ending after a question mark \+ period highly unlikely\.

### O\.3Diagnostic Visualizations

![Refer to caption](https://arxiv.org/html/2605.11601v1/x11.png)

![Refer to caption](https://arxiv.org/html/2605.11601v1/x12.png)

Figure 10:Left:PMI decomposition visualization showing how conditional and marginal scores separate for different quality levels\.Right:Token\-level quality profile heatmap illustrating fine\-grained quality patterns\.The diagnostic tools ofDiffScoreenable fine\-grained analysis beyond scalar scores\. The PMI decomposition \(Figure[10](https://arxiv.org/html/2605.11601#A15.F10), left\) visually demonstrates the separation of fluency and relevance components across quality tiers, while the quality profile heatmap \(Figure[10](https://arxiv.org/html/2605.11601#A15.F10), right\) reveals which tokens are most difficult to reconstruct at different masking rates, providing actionable diagnostic information for NLG system developers\.

## Appendix PMasking Strategy Comparison

![Refer to caption](https://arxiv.org/html/2605.11601v1/x13.png)Figure 11:Comparison of masking strategies on SummEval\. Uniform random masking substantially outperforms structured alternatives\.Beyond the ablation results in the main text \(Table[5](https://arxiv.org/html/2605.11601#S6.T5)\), Figure[11](https://arxiv.org/html/2605.11601#A16.F11)visualizes the per\-dimension performance of different masking strategies\. Table[21](https://arxiv.org/html/2605.11601#A16.T21)provides the detailed numerical comparison\.

Table 21:Masking strategy comparison on SummEval \(Spearmanρ\\rho\)\. Uniform random masking outperforms structured alternatives across all dimensions\.Uniform random masking outperforms both content\-word\-only and entity\-only masking across all dimensions\. This confirms that comprehensive coverage of all token types—including function words, punctuation, and structural tokens—is essential for reliable quality assessment\. Entity\-only masking suffers the most, achieving only 51\.5% of the random masking performance on average, as it neglects the syntactic and discourse\-level signals that are critical for fluency and coherence evaluation\. Content\-word\-only masking performs moderately on consistency \(99\.0% relative\) but poorly on coherence \(72\.5%\) and relevance \(65\.9%\), suggesting that function words carry important signals about discourse structure\.

## Appendix QSensitivity to Timestep Discretization

![Refer to caption](https://arxiv.org/html/2605.11601v1/x14.png)Figure 12:Full timestep×\\timesdimension heatmap on SummEval\. Each cell shows Spearmanρ\\rhobetween the single\-timestepDiffScoreand human judgments\.Figure[12](https://arxiv.org/html/2605.11601#A17.F12)provides the complete timestep×\\timesdimension correlation heatmap\. The clear diagonal pattern—fluency peaking at lowtt, consistency at midtt, coherence and relevance at hightt—provides strong empirical support for the multi\-granularity hypothesis that underlies the quality profile mechanism\.

Notably,t=1\.0t\{=\}1\.0yields zero correlation for all dimensions, as fully masking all tokens eliminates any informative context and reduces prediction to the model’s unconditional prior\. This confirms that the evaluation signal arises from the interplay between masked and unmasked tokens, not from the model’s prior alone\.

## Appendix RAblation Study: Extended Results

We provide extended ablation results beyond those in the main text Table[5](https://arxiv.org/html/2605.11601#S6.T5), including per\-dimension breakdowns for all configurations\.

### R\.1Scoring Mode Comparison: ELBO vs\. Mean Log\-Probability

Table[22](https://arxiv.org/html/2605.11601#A18.T22)compares the strict ELBO weighting \(ω\(tk\)=1/tk\\omega\(t\_\{k\}\)=1/t\_\{k\}\) with the mean log\-probability \(MLP\) weighting \(ω\(tk\)=1/\|ℳtk\|\\omega\(t\_\{k\}\)=1/\|\\mathcal\{M\}\_\{t\_\{k\}\}\|\)\.

Table 22:Scoring mode comparison on SummEval \(Spearmanρ\\rho\)\. MLP yields more stable estimates\.The MLP mode consistently outperforms ELBO weighting, with particularly large gains on coherence \(\+\.065\+\.065\) and relevance \(\+\.070\+\.070\)\. This advantage arises because the1/t1/tfactor in the ELBO amplifies high\-variance estimates from low masking rates, where only a few tokens are masked\. MLP normalizes by the actual number of masked positions, producing more stable per\-token scores\.

### R\.2Timestep Discretization

The ablation over timestep countsTT\(Table[5](https://arxiv.org/html/2605.11601#S6.T5)in the main text\) shows thatT=10T\{=\}10achieves the best average performance\. Finer discretization \(T=20T\{=\}20\) introduces estimation noise because each timestep bin contains fewer samples, increasing the variance of per\-timestep estimates\. Coarser discretization \(T=5T\{=\}5\) misses the optimal granularity for some dimensions \(notably coherence, wheret=0\.9t\{=\}0\.9is the most informative single timestep\)\.

## Appendix SLimitations and Future Work

#### Computational overhead\.

The primary limitation ofDiffScoreis computational cost\. The standard configuration \(K=20K\{=\}20\) requires 20 forward passes per sample, yielding∼\\sim88×\\timeshigher latency than BARTScore\. While theDiffScore\-Fastvariant \(K=5K\{=\}5\) reduces this to∼\\sim24×\\times, it remains substantially more expensive than single\-pass metrics\. Future work could explore amortized inference techniques or distillation to produce a single\-pass approximation ofDiffScore\.

#### Model scale\.

Our primary backbone \(LLaDA\-8B\) is20×20\\timeslarger than BART\-large \(406M\)\. While we demonstrate that the paradigm advantage holds across architectures \(Dream\-7B, Appendix[N](https://arxiv.org/html/2605.11601#A14)\), we have not yet evaluated smaller MDLLMs \(<<1B parameters\) to determine the minimum scale required for competitive evaluation\. Currently, no such smaller MDLLMs are publicly available; as the ecosystem matures, smaller models may become viable and would directly benefitDiffScore\.

#### Language coverage\.

All experiments are conducted on English\-language benchmarks\. While MDLLMs like LLaDA are trained on multilingual data, we have not validatedDiffScoreon non\-English evaluation benchmarks\. Extending to multilingual evaluation is a natural direction for future work\.

#### Quality dimension coverage\.

Our evaluation focuses on standard NLG quality dimensions \(fluency, coherence, faithfulness, relevance, coverage\)\. Emerging evaluation desiderata such as safety, toxicity, and factual grounding in external knowledge bases are not directly addressed\. The flexibility of the scoring configurations \(Appendix[E](https://arxiv.org/html/2605.11601#A5)\) suggests potential applicability, but empirical validation is needed\.

#### Interaction with generation\.

An intriguing direction is usingDiffScoreas a reward signal for MDLLM\-based text generation\. Since the evaluation metric and generator share the same architecture, this could enable more efficient and aligned training of NLG systems\.

## Appendix TBroader Impact

DiffScoreis an evaluation framework for natural language generation\. As with any automated evaluation metric, there are considerations regarding its societal impact\.

Positive impacts\.\(1\) Providing a more accurate and interpretable evaluation metric can improve the development of NLG systems, leading to higher\-quality generated text\. \(2\) The open\-weight nature ofDiffScore\(based on publicly available MDLLMs\) offers a transparent and reproducible alternative to proprietary LLM\-based evaluators such as G\-Eval\. \(3\) The PMI decomposition and quality profiles enable more nuanced quality assessment, potentially helping identify specific failure modes in NLG systems\.

Potential risks\.\(1\) Over\-reliance on any automated metric may lead to optimization for metric scores rather than genuine quality improvements \(Goodhart’s Law\)\. \(2\) The evaluation capabilities could potentially be used to generate adversarial text that scores high on the metric while being low quality by other measures\. \(3\) As with any model\-based metric,DiffScoreinherits potential biases from the underlying MDLLM’s pre\-training data\.

We encourage responsible use ofDiffScoreas one component of a comprehensive evaluation pipeline that includes human judgment\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The paper’s contributions and scope are delineated in the Abstract and Section[1](https://arxiv.org/html/2605.11601#S1), with the first and last paragraphs of Section[1](https://arxiv.org/html/2605.11601#S1)specifying each respectively\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: The limitations of the work is discussed in the Appendix[S](https://arxiv.org/html/2605.11601#A19)\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: The paper provides detailed assumptions and proofs in Section[2](https://arxiv.org/html/2605.11601#S2)and Section[3](https://arxiv.org/html/2605.11601#S3)\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: The paper provides sufficient reproducibility details in Section[4](https://arxiv.org/html/2605.11601#S4)\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: We use publicly available datasets on GitHub and Hugging Face\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Detailed experimental configurations are provided in Section[4](https://arxiv.org/html/2605.11601#S4), with full implementation details in Appendix[D](https://arxiv.org/html/2605.11601#A4)\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: We report the statistics in Section[5](https://arxiv.org/html/2605.11601#S5)\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: Computational resource specifications are documented in Section[4](https://arxiv.org/html/2605.11601#S4)and Appendix[D](https://arxiv.org/html/2605.11601#A4)\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: Our research strictly adheres to the NeurIPS Code of Ethics\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: Please refer to Appendix[T](https://arxiv.org/html/2605.11601#A20)\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper poses no such risks\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We politely cited the existing assets and read their usage license\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.11601v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: Comprehensive documentation for newly introduced assets \(e\.g\., code, data\) is provided in the supplementary material\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing nor research with human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: No human subjects were used on our work\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: Not applicable\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.
DiffScore: Text Evaluation Beyond Autoregressive Likelihood

Similar Articles

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

TextLDM: Language Modeling with Continuous Latent Diffusion

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

Submit Feedback

Similar Articles

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
TextLDM: Language Modeling with Continuous Latent Diffusion
Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language