PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference
Summary
Introduces PoQ-Judge, a multi-architecture evaluation framework with reference-free judge models (TextCNN, MiniLM, DeBERTa) for cost-aware Proof-of-Quality in decentralized LLM inference, achieving high correlation with ground-truth proxies while eliminating the need for reference answers.
View Cached Full Text
Cached at: 06/11/26, 01:34 PM
# A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference
Source: [https://arxiv.org/html/2606.11196](https://arxiv.org/html/2606.11196)
Arther Tiana, Alex Dinga,\*, Frank Chena Simon Wua, Aaron Chana aDGrid AI \*Corresponding author:alex\.ding@dgrid\.ai
###### Abstract
Decentralized large language model \(LLM\) inference networks require lightweight quality evaluation to drive consensus and reward allocation under Proof of Quality \(PoQ\)\. Prior work established cost\-aware PoQ, adversarial robustness via adaptive trust weighting, and multi\-dimensional quality scoring, but the strongest quality dimensions rely on reference\-based semantic similarity—a signal unavailable at inference time when ground\-truth answers do not exist\. This paper introduces*PoQ\-Judge*, a multi\-architecture evaluation framework that trains dedicated reference\-free judge models to score query–output pairs without access to reference answers\. We design three judge architectures spanning a quality–cost Pareto frontier: a TextCNN judge \(∼\{\\sim\}10M parameters, sub\-millisecond latency\), a MiniLM cross\-encoder judge \(22M parameters\), and a DeBERTa judge \(184M parameters\)\. All judges are trained via a two\-stage pipeline: pre\-training on the UltraFeedback corpus followed by fine\-tuning on GPT\-labeled in\-domain data covering question answering and summarization tasks\. On a held\-out test set \(n=300n\{=\}300\), the DeBERTa judge achieves a Pearson correlation of 0\.747 with the ground\-truth proxy \(95% CI \[0\.663, 0\.816\]\), exceeding all reference\-based evaluators from our prior framework\. When integrated as a reference\-free dimension in composite scoring, the resulting signal attains 0\.645 Pearson correlation—matching the best single reference\-based evaluator without requiring reference answers\. We further study online dimension calibration via gradient\-based weight learning, which correctly identifies semantic quality as the dominant dimension \(learned weight4\.7×4\.7\\timesinitial\), and a cascade evaluation protocol that achieves 72\.7% cost savings with modest quality reduction\. Experiments reveal sharp task dependence—QA Pearson reaches 0\.830 while summarization drops to 0\.199—highlighting ground\-truth proxy limitations as the primary open challenge\.
## 1Introduction
Decentralized LLM inference has emerged as a practical direction for scaling language model serving under constrained and heterogeneous compute\. Systems such as Petals demonstrate the feasibility of collaborative inference across distributed participants\[[5](https://arxiv.org/html/2606.11196#bib.bib5)\], while serving optimizations including paged attention and IO\-aware attention highlight that throughput and memory efficiency remain central bottlenecks even in centralized deployments\[[24](https://arxiv.org/html/2606.11196#bib.bib28),[10](https://arxiv.org/html/2606.11196#bib.bib29)\]\. In decentralized settings, a fundamental challenge is*verifying and pricing output quality*: participants contribute different models, hardware, and serving policies, and the network must assign rewards that reflect the usefulness of produced outputs without relying on heavyweight cryptographic proofs\[[32](https://arxiv.org/html/2606.11196#bib.bib31),[3](https://arxiv.org/html/2606.11196#bib.bib30)\]\.
Proof of Quality \(PoQ\) addresses this challenge by using lightweight evaluator models to score outputs and drive consensus\-based incentives\[[44](https://arxiv.org/html/2606.11196#bib.bib4)\]\. Our prior work developed this line in three stages\. First, cost\-aware PoQ introduced explicit latency\-based cost signals into reward computation, jointly optimizing for output quality and evaluation efficiency\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\]\. Second, adaptive robust PoQ integrated Byzantine\-resilient aggregation and adaptive trust weighting to tolerate malicious or unreliable evaluators\[[41](https://arxiv.org/html/2606.11196#bib.bib2)\]\. Third, a multi\-dimensional quality scoring framework decomposed evaluation into interpretable dimensions—model priors, structural quality, semantic similarity, query–output alignment, and evaluator agreement—and showed that calibrated composites can match or exceed single\-evaluator baselines\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\.
However, the multi\-dimensional framework exposed a critical deployment gap\. The strongest and most reliable dimension—semantic quality, based on sentence embedding similarity—requires access to a reference answer to compute its score\. In a live decentralized inference network, reference answers are generally unavailable: users submit queries and receive outputs, but no ground\-truth response exists for comparison\. Pre\-trained evaluators that do not require references, such as NLI\-based cross\-encoders, were found to correlate poorly or even negatively with ground\-truth quality, making them unsuitable as standalone quality signals\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\. This creates a tension: the evaluation signal that PoQ most needs—a reliable, reference\-free quality score—is precisely the signal that off\-the\-shelf metrics fail to provide\.
Meanwhile, the LLM\-as\-a\-Judge paradigm has shown that language models can serve as effective evaluators when prompted or fine\-tuned for the task\[[46](https://arxiv.org/html/2606.11196#bib.bib10),[28](https://arxiv.org/html/2606.11196#bib.bib24)\]\. Systems such as Prometheus demonstrate that open\-source models can be specialized for evaluation with strong human correlation\[[20](https://arxiv.org/html/2606.11196#bib.bib21),[21](https://arxiv.org/html/2606.11196#bib.bib22)\]\. However, these approaches typically employ billion\-parameter models, incurring latency and cost that conflict with the efficiency requirements of PoQ\-style evaluation where thousands of outputs must be scored per consensus round\.
This paper introduces*PoQ\-Judge*, a multi\-architecture evaluation framework that bridges the reference\-free gap through dedicated, lightweight judge models trained specifically for decentralized inference quality assessment\. Our key insight is that the evaluation task in PoQ—scoring a \(query, output\) pair on a continuous quality scale—can be cast as a regression problem and solved by compact encoder models, without requiring the generative capacity of billion\-parameter judges\. We design three judge architectures that span a quality–cost Pareto frontier:
- •TextCNN Judge\(∼\{\\sim\}10M parameters\): a convolutional architecture\[[22](https://arxiv.org/html/2606.11196#bib.bib38)\]offering sub\-millisecond inference, suitable for high\-throughput or cost\-constrained evaluation tiers\.
- •MiniLM Judge\(22M parameters\): a cross\-encoder built on a distilled Transformer backbone\[[36](https://arxiv.org/html/2606.11196#bib.bib17)\], balancing quality and latency\.
- •DeBERTa Judge\(184M parameters\): a disentangled\-attention encoder\[[15](https://arxiv.org/html/2606.11196#bib.bib19),[14](https://arxiv.org/html/2606.11196#bib.bib20)\]targeting the highest accuracy tier\.
All three judges are trained via a two\-stage pipeline: broad pre\-training on the UltraFeedback corpus\[[9](https://arxiv.org/html/2606.11196#bib.bib23)\]followed by targeted fine\-tuning on GPT\-labeled data from our PoQ task distribution\. The trained judges are then integrated as a new reference\-free dimension within the multi\-dimensional composite scoring framework ofTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\.
User Queryqq\+ Outputyiy\_\{i\}UltraFeedback\(45k samples\)GPT\-LabeledDomain Data\(1400 train\)Stage 1:Pre\-trainStage 2:Fine\-tuneTextCNN10M, 1msMiniLM22M, 13msDeBERTa184M, 15msPoQ\-Judge Models \(Reference\-Free\)Composite Quality Scores^\(q,yi\)\\hat\{s\}\(q,y\_\{i\}\)Priors \+StructureSemantic \+Alignment\(if ref\. avail\.\)PoQ Consensus\+ Reward AllocationOnlineCalibrationCascadeProtocol
Figure 1:Overview of the PoQ\-Judge framework\. Three judge architectures are trained via a two\-stage pipeline \(top\) and deployed as reference\-free quality dimensions \(middle\)\. Judge scores are combined with structural priors and optional reference\-based dimensions into a composite quality signal, which feeds into PoQ consensus and reward allocation \(bottom\)\. Online calibration adjusts dimension weights during deployment, and a cascade protocol enables cost\-aware early stopping\.Figure[1](https://arxiv.org/html/2606.11196#S1.F1)illustrates the complete framework\. Trained judges provide reference\-free quality scores that are integrated alongside structural priors and, when available, reference\-based semantic dimensions into a composite signal compatible with PoQ aggregation and incentive mechanisms\. Two additional deployment mechanisms—online dimension calibration via gradient\-based weight learning, and a cascade evaluation protocol for cost\-aware early stopping—further adapt the framework to the operational constraints of decentralized inference\.
Our main experimental findings are as follows\. On a held\-out test set of 300 samples spanning question answering and summarization tasks, the DeBERTa judge achieves Pearson correlation 0\.747 with the ground\-truth quality proxy \(95% bootstrap CI\[0\.663,0\.816\]\[0\.663,0\.816\]\), exceeding the best reference\-based evaluator from our prior framework \(sts\_paraphrase: 0\.629\)\. The reference\-free composite scoring mode, integrating judge scores with structural priors, attains Pearson 0\.645—matching the strongest single reference\-based evaluator without requiring reference answers\. Gradient\-based online calibration correctly identifies semantic quality as the dominant dimension, assigning it4\.7×4\.7\\timesthe initial weight while suppressing unreliable dimensions to near zero\. We also observe sharp task dependence: QA Pearson reaches 0\.830 while summarization Pearson drops to 0\.199, attributable primarily to limitations in the token\-level F1 ground\-truth proxy for summarization\[[27](https://arxiv.org/html/2606.11196#bib.bib15),[23](https://arxiv.org/html/2606.11196#bib.bib26)\]\. Finally, the TextCNN judge offers sub\-millisecond latency at Pearson 0\.472, establishing a viable low\-cost evaluation tier for high\-throughput deployments\.
#### Contributions\.
This paper makes the following contributions\.
- •We introducePoQ\-Judge, a multi\-architecture reference\-free evaluation framework for decentralized LLM inference, training three judge models \(TextCNN, MiniLM, DeBERTa\) via a two\-stage pipeline that transfers broad evaluation knowledge to the PoQ task distribution\.
- •We provide aquality–cost Pareto analysisacross judge architectures with bootstrap confidence intervals, showing that the DeBERTa judge \(0\.747 Pearson\) exceeds reference\-based baselines while the TextCNN judge \(<<1ms\) enables cost\-sensitive evaluation tiers\.
- •We demonstrate thatreference\-free composite scoring\(Pearson 0\.645\) matches the best single reference\-based evaluator, closing the deployment gap identified in our prior multi\-dimensional framework\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\.
- •We studyonline dimension calibrationvia EMA, bandit, and gradient strategies, showing that gradient\-based weight learning recovers interpretable dimension rankings consistent with offline reliability analysis\.
- •We design acascade evaluation protocolthat achieves up to 72\.7% cost savings by routing confident samples through lightweight structural checks and reserving full evaluation for uncertain cases\.
#### Paper organization\.
Section[2](https://arxiv.org/html/2606.11196#S2)reviews PoQ and the reference\-free evaluation gap\. Section[3](https://arxiv.org/html/2606.11196#S3)presents the PoQ\-Judge framework, including judge architectures, training pipeline, composite integration, online calibration, and cascade evaluation\. Section[4](https://arxiv.org/html/2606.11196#S4)describes the experimental setup\. Section[5](https://arxiv.org/html/2606.11196#S5)reports results on judge quality, task dependence, composite scoring, calibration, and cascade tradeoffs\. Section[6](https://arxiv.org/html/2606.11196#S6)discusses findings and limitations\. Sections[7](https://arxiv.org/html/2606.11196#S7)and[8](https://arxiv.org/html/2606.11196#S8)cover related work and conclusions\.
## 2Background and Problem Setting
This section summarizes the PoQ framework developed in our prior work and articulates the reference\-free evaluation gap that motivates the present study\. We keep the review concise; detailed formulations of cost\-aware rewards, robust aggregation, and multi\-dimensional scoring appear inTianet al\.\[[39](https://arxiv.org/html/2606.11196#bib.bib1),[41](https://arxiv.org/html/2606.11196#bib.bib2),[40](https://arxiv.org/html/2606.11196#bib.bib3)\], respectively\.
### 2\.1Proof of Quality for Decentralized Inference
#### System model\.
We consider a decentralized inference network comprising a set of inference nodesℐ\\mathcal\{I\}that serve LLM outputs and a set of evaluator nodesℰ\\mathcal\{E\}that score those outputs\. For a user queryqq, inference nodeiiproduces a candidate outputyiy\_\{i\}\. Each evaluatoreecomputes a scorese\(q,yi\)∈\[0,10\]s\_\{e\}\(q,y\_\{i\}\)\\in\[0,10\]reflecting perceived quality\. Scores are aggregated into a consensus estimates^\(q,yi\)\\hat\{s\}\(q,y\_\{i\}\)that drives reward allocationπ\(i\)\\pi\(i\)to inference nodes\. Cryptographic verification of inference correctness remains costly for large\-scale real\-time serving\[[32](https://arxiv.org/html/2606.11196#bib.bib31),[3](https://arxiv.org/html/2606.11196#bib.bib30)\], making evaluator\-based statistical verification the practical alternative\[[44](https://arxiv.org/html/2606.11196#bib.bib4)\]\.
#### Cost\-aware PoQ\.
Our first extension to PoQ introduced explicit cost awareness by incorporating latency\-based cost signals into the reward function\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\]\. Letcic\_\{i\}denote the normalized inference cost for nodeiiandcec\_\{e\}the evaluation cost for evaluatoree\. The reward function balances output quality against cost:
π\(i\)=f\(s^\(q,yi\),ci\)\\pi\(i\)\\;=\\;f\\\!\\left\(\\hat\{s\}\(q,y\_\{i\}\),\\;c\_\{i\}\\right\)\(1\)whereffpenalizes low quality and rewards cost efficiency, ensuring that cheaper nodes producing comparable quality receive appropriate incentives\. Evaluator nodes are similarly rewarded based on their closeness to consensus and their evaluation cost\.
#### Adaptive robust PoQ\.
In open\-participation networks, evaluators may be noisy, biased, or adversarial\. Our second extension addressed this through robust aggregation rules—median, trimmed mean, and adaptive weighted consensus—that reduce the influence of outlier scores\[[41](https://arxiv.org/html/2606.11196#bib.bib2)\]\. Adaptive trust weighting maintains per\-evaluator reliability estimateswew\_\{e\}that are updated online based on deviation from consensus:
we\(t\+1\)=we\(t\)⋅g\(\|se−s^\|\)w\_\{e\}^\{\(t\+1\)\}\\;=\\;w\_\{e\}^\{\(t\)\}\\cdot g\\\!\\left\(\\left\|s\_\{e\}\-\\hat\{s\}\\right\|\\right\)\(2\)whereg\(⋅\)g\(\\cdot\)is a monotonically decreasing function that down\-weights evaluators with large deviations\. This mechanism draws on principles from Byzantine\-resilient aggregation and robust distributed learning\[[6](https://arxiv.org/html/2606.11196#bib.bib8),[4](https://arxiv.org/html/2606.11196#bib.bib6),[42](https://arxiv.org/html/2606.11196#bib.bib7),[12](https://arxiv.org/html/2606.11196#bib.bib9)\]\.
#### Multi\-dimensional quality scoring\.
Our third extension moved beyond single\-evaluator scoring to a multi\-dimensional composite\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\. Quality is decomposed intoKKinterpretable dimensions, each producing a normalized scorezk\(q,y\)∈\[0,10\]z\_\{k\}\(q,y\)\\in\[0,10\]:
s^\(q,y\)=∑k=1Kw¯kzk\(q,y\),w¯k=wk∑jwj\\hat\{s\}\(q,y\)\\;=\\;\\sum\_\{k=1\}^\{K\}\\bar\{w\}\_\{k\}\\,z\_\{k\}\(q,y\),\\qquad\\bar\{w\}\_\{k\}=\\frac\{w\_\{k\}\}\{\\sum\_\{j\}w\_\{j\}\}\(3\)Dimensions include model priors, structural quality heuristics, semantic similarity \(reference\-based\), query–output alignment \(NLI\-style\), and cross\-evaluator agreement\. A systematic reliability audit revealed that while semantic quality correlates strongly with the ground\-truth proxy \(Pearson 0\.733 on 2000 samples\), two other dimensions—query–output alignment and agreement/uncertainty—exhibit negative correlations overall and sharp task dependence, degrading the composite when included without calibration\.
Table 1:Summary of PoQ extensions in prior work and the gap addressed by this paper\.
### 2\.2The Reference\-Free Evaluation Gap
The reliability audit inTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]produced a clear ranking: semantic quality \(Pearson 0\.733 with GT\) dominates all other dimensions, while structural priors \(0\.466\), model priors, and other signals provide weaker complementary information\. However, semantic quality is computed as the embedding similarity between the model outputyyand a reference answerrr, typically using sentence encoders such as Sentence\-BERT or its variants\[[36](https://arxiv.org/html/2606.11196#bib.bib17),[13](https://arxiv.org/html/2606.11196#bib.bib18)\]\. This creates a fundamental deployment constraint: in a live inference network, reference answersrrare not available\.
Pre\-trained evaluators that operate without references were found to be unreliable in the PoQ setting\. Cross\-encoder models trained on natural language inference \(NLI\) tasks produced correlations that were near zero or negative: the CE\-DeBERTa cross\-encoder achieved−0\.363\-0\.363Pearson with GT on the test set, while CE\-MiniLM reached only 0\.331\. These models were trained for textual entailment rather than open\-ended quality assessment, and their scoring behavior does not align with the quality notion needed for PoQ rewards\.
The LLM\-as\-a\-Judge paradigm offers one potential solution\. Recent work has shown that large language models can serve as effective evaluators when appropriately prompted\[[46](https://arxiv.org/html/2606.11196#bib.bib10),[28](https://arxiv.org/html/2606.11196#bib.bib24)\], and that specialized evaluation models can achieve strong human correlation\[[20](https://arxiv.org/html/2606.11196#bib.bib21),[21](https://arxiv.org/html/2606.11196#bib.bib22)\]\. However, these systems typically require billion\-parameter generative models, which are too expensive for the scale of evaluation needed in PoQ—where each consensus round may require scoring multiple outputs from multiple evaluators within tight latency budgets\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\]\. Moreover, LLM judges can exhibit position bias, verbosity bias, and self\-enhancement bias\[[7](https://arxiv.org/html/2606.11196#bib.bib25),[46](https://arxiv.org/html/2606.11196#bib.bib10)\], introducing systematic errors that compound across thousands of evaluation rounds\.
Holistic evaluation benchmarks such as HELM\[[26](https://arxiv.org/html/2606.11196#bib.bib12)\]and preference\-based platforms like Chatbot Arena\[[8](https://arxiv.org/html/2606.11196#bib.bib11)\]have advanced our understanding of LLM quality assessment, but these operate in offline, centralized settings and do not address the latency, cost, and trust requirements of decentralized deployment\.
### 2\.3Problem Statement
We seek to train lightweight, reference\-free judge models that satisfy the following requirements:
1. 1\.Reference\-free\.The judge scores a\(q,y\)\(q,y\)pair without access to a ground\-truth answerrr, producingsθ\(q,y\)∈\[0,10\]s\_\{\\theta\}\(q,y\)\\in\[0,10\]whereθ\\thetadenotes the model parameters\.
2. 2\.Aligned\.Judge scores should correlate with ground\-truth quality proxies at least as well as the strongest reference\-based evaluators in the existing framework\.
3. 3\.Efficient\.Inference latency should be compatible with PoQ evaluation budgets, spanning from sub\-millisecond \(structural tier\) to tens of milliseconds \(full evaluation tier\)\.
4. 4\.Composable\.Judge scores should integrate as a dimension within the multi\-dimensional composite scoring framework, compatible with online calibration and PoQ aggregation\.
The design space involves architecture selection \(trading capacity for latency\), training strategy \(data sources and transfer\), and integration protocol \(weighting, calibration, and cascade evaluation\)\. Table[1](https://arxiv.org/html/2606.11196#S2.T1)situates this problem relative to our prior PoQ extensions\.
## 3PoQ\-Judge: Reference\-Free Evaluation Framework
This section presents the PoQ\-Judge framework in five parts: judge model architectures \(Section[3\.1](https://arxiv.org/html/2606.11196#S3.SS1)\), the two\-stage training pipeline \(Section[3\.2](https://arxiv.org/html/2606.11196#S3.SS2)\), integration into composite scoring \(Section[3\.3](https://arxiv.org/html/2606.11196#S3.SS3)\), online dimension calibration \(Section[3\.4](https://arxiv.org/html/2606.11196#S3.SS4)\), and the cascade evaluation protocol \(Section[3\.5](https://arxiv.org/html/2606.11196#S3.SS5)\)\.
### 3\.1Judge Model Architectures
We design three judge architectures that target distinct operating points on the quality–cost Pareto frontier\. All three share a common interface: given a query–output pair\(q,y\)\(q,y\), the judge produces a scalar quality scoresθ\(q,y\)∈\[0,10\]s\_\{\\theta\}\(q,y\)\\in\[0,10\]via a regression head, without access to any reference answer\.
#### TextCNN Judge \(∼\{\\sim\}10M parameters\)\.
The lightest architecture uses a convolutional neural network for text regression\[[22](https://arxiv.org/html/2606.11196#bib.bib38)\]\. The query and output are concatenated as a single token sequence, embedded via a learned word embedding matrix𝐄∈ℝV×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{V\\times d\}, and processed by parallel 1D convolutional filters with kernel sizes\{2,3,4,5\}\\\{2,3,4,5\\\}and 128 filters per kernel\. Max\-over\-time pooling extracts a fixed\-length representation, which is passed through a dropout layer and a linear regression head:
sθ\(q,y\)=𝐰⊤MaxPool\(⨁k∈𝒦Conv1Dk\(𝐄\[q⊕y\]\)\)\+bs\_\{\\theta\}\(q,y\)=\\mathbf\{w\}^\{\\top\}\\,\\text\{MaxPool\}\\\!\\left\(\\bigoplus\_\{k\\in\\mathcal\{K\}\}\\text\{Conv1D\}\_\{k\}\(\\mathbf\{E\}\[q\\oplus y\]\)\\right\)\+b\(4\)where⨁\\bigoplusdenotes concatenation over kernel sizes𝒦\\mathcal\{K\}and𝐰,b\\mathbf\{w\},bare the regression parameters\. The TextCNN judge achieves sub\-millisecond inference latency on GPU \(∼\{\\sim\}1ms on CPU\), making it suitable for the lowest\-cost evaluation tier in PoQ deployments\.
#### MiniLM Judge \(22M parameters\)\.
The mid\-tier architecture uses a cross\-encoder formulation built on a distilled Transformer backbone\[[36](https://arxiv.org/html/2606.11196#bib.bib17)\]\. The query and output are jointly encoded as a single input sequence with segment separation:
𝐡=Encoder\(\[CLS\]q\[SEP\]y\[SEP\]\)\\mathbf\{h\}=\\text\{Encoder\}\\\!\\left\(\[\\texttt\{CLS\}\]\\;q\\;\[\\texttt\{SEP\}\]\\;y\\;\[\\texttt\{SEP\}\]\\right\)\(5\)The\[CLS\]\[\\texttt\{CLS\}\]representation is projected through a linear regression head to produce the quality score\. Cross\-encoder architectures allow full token\-level attention between query and output, capturing fine\-grained semantic interactions that bag\-of\-words or pooled representations miss\.
#### DeBERTa Judge \(184M parameters\)\.
The highest\-quality architecture uses DeBERTa\-v3\-base\[[15](https://arxiv.org/html/2606.11196#bib.bib19),[14](https://arxiv.org/html/2606.11196#bib.bib20)\], which introduces disentangled attention over content and position embeddings and uses ELECTRA\-style pre\-training with gradient\-disentangled embedding sharing\. The input format follows the same cross\-encoder template as Equation[5](https://arxiv.org/html/2606.11196#S3.E5)\. DeBERTa’s disentangled attention mechanism has shown strong performance on natural language understanding benchmarks, which we hypothesize transfers well to quality assessment where nuanced understanding of query–output relationships is important\.
Table 2:Judge architecture summary\. All three models use a shared interface: input is a\(q,y\)\(q,y\)pair, output is a scalar score in\[0,10\]\[0,10\]\. Latency is measured on a single NVIDIA GPU with batch size 1\.Table[2](https://arxiv.org/html/2606.11196#S3.T2)summarizes the three architectures\. The15×15\\timeslatency gap between TextCNN and DeBERTa motivates the cascade evaluation protocol in Section[3\.5](https://arxiv.org/html/2606.11196#S3.SS5), where cheap judges handle confident cases and expensive judges are reserved for ambiguous ones\.
### 3\.2Two\-Stage Training Pipeline
Training a quality judge from scratch on domain\-specific data alone risks overfitting to the limited labeled set\. We address this through a two\-stage pipeline that first transfers broad evaluation knowledge from a large\-scale AI feedback corpus, then adapts to the target PoQ task distribution\.
#### Stage 1: Pre\-training on UltraFeedback\.
UltraFeedback is a large\-scale dataset of instruction–response pairs annotated with multi\-aspect quality scores by GPT\-4\[[9](https://arxiv.org/html/2606.11196#bib.bib23)\]\. We use a subset of 45,000 training and 5,000 validation samples, each consisting of a prompt, a model response, and an overall quality score\. Pre\-training exposes the judge to diverse instruction\-following patterns and quality variations, providing a warm start for the regression task before fine\-tuning on the narrower PoQ distribution\. The training objective is mean squared error \(MSE\) between predicted and labeled scores:
ℒpretrain=1N∑i=1N\(sθ\(qi,yi\)−y^i\)2\\mathcal\{L\}\_\{\\text\{pretrain\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(s\_\{\\theta\}\(q\_\{i\},y\_\{i\}\)\-\\hat\{y\}\_\{i\}\\right\)^\{2\}\(6\)
#### Stage 2: Fine\-tuning on GPT\-labeled domain data\.
The second stage fine\-tunes on data drawn from the PoQ task distribution\. We construct labeled data by running five heterogeneous inference models on question answering\[[33](https://arxiv.org/html/2606.11196#bib.bib36)\]and summarization\[[17](https://arxiv.org/html/2606.11196#bib.bib37)\]tasks, then scoring each output with GPT\-4o\-mini as a reference\-free judge\. This produces 1,400 training, 300 validation, and 300 test samples, with quality scores on a\[0,10\]\[0,10\]scale\. Fine\-tuning uses the same MSE loss \(Equation[6](https://arxiv.org/html/2606.11196#S3.E6)\) with a reduced learning rate and early stopping based on validation Pearson correlation\.
UltraFeedbackTrain \(45k\)UltraFeedbackVal \(5k\)Stage 1: Pre\-trainMSE lossLR:10−310^\{\-3\}/2×10−52\{\\times\}10^\{\-5\}Domain Train\(1,400\)Domain Val\(300\)Stage 2: Fine\-tuneMSE loss, early stopLR:5×10−45\{\\times\}10^\{\-4\}/5×10−65\{\\times\}10^\{\-6\}transferTextCNNJudgeMiniLMJudgeDeBERTaJudgeHeld\-out Test\(300\)evaluation only
Figure 2:Two\-stage training pipeline for PoQ\-Judge models\. Stage 1 pre\-trains on the large\-scale UltraFeedback corpus to learn general evaluation patterns\. Stage 2 fine\-tunes on GPT\-labeled domain data from the PoQ task distribution\. The held\-out test set \(300 samples\) is used only for final evaluation\.Figure[2](https://arxiv.org/html/2606.11196#S3.F2)illustrates the pipeline\. The two\-stage design is motivated by the data regime: the domain\-specific labeled set \(1,400 training samples\) is too small to train encoder models from a random initialization, but sufficient to adapt pre\-trained representations\. Learning rates are architecture\-dependent: TextCNN uses higher rates \(10−310^\{\-3\}pre\-train,5×10−45\{\\times\}10^\{\-4\}fine\-tune\) because its parameters are trained from scratch, while the encoder models use standard fine\-tuning rates \(2×10−52\{\\times\}10^\{\-5\}pre\-train,5×10−65\{\\times\}10^\{\-6\}fine\-tune\) to avoid catastrophic forgetting\. All models use AdamW optimization with weight decay 0\.01 and patience\-based early stopping on validation Pearson correlation\.
### 3\.3Integration into Composite Scoring
Trained judges are integrated as a new*reference\-free*dimension into the multi\-dimensional composite scoring framework ofTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\. We extend the original composite scorer with three operating modes that reflect different levels of reference availability\.
#### Scoring modes\.
Let𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}denote dimensions requiring a reference answer \(semantic quality, query–output alignment, consensus agreement\) and𝒟free\\mathcal\{D\}\_\{\\text\{free\}\}denote reference\-free dimensions \(model prior, cost\-efficiency prior, structural quality, judge score, query relevance\)\. The composite score under modemmis:
s^m\(q,y\)=∑k∈𝒟mw¯kzk\(q,y\),𝒟m=\{𝒟ref∪𝒟freem=full𝒟freem=ref\_freeauto\-detectm=auto\\hat\{s\}\_\{m\}\(q,y\)=\\sum\_\{k\\in\\mathcal\{D\}\_\{m\}\}\\bar\{w\}\_\{k\}\\,z\_\{k\}\(q,y\),\\qquad\\mathcal\{D\}\_\{m\}=\\begin\{cases\}\\mathcal\{D\}\_\{\\text\{ref\}\}\\cup\\mathcal\{D\}\_\{\\text\{free\}\}&m=\\texttt\{full\}\\\\ \\mathcal\{D\}\_\{\\text\{free\}\}&m=\\texttt\{ref\\\_free\}\\\\ \\text\{auto\-detect\}&m=\\texttt\{auto\}\\end\{cases\}\(7\)wherew¯k\\bar\{w\}\_\{k\}are renormalized weights over the active dimension set\. Theautomode checks whether a reference answer is present in the record and selectsfullorref\_freeaccordingly\. This design allows the same composite scorer to operate in both offline analysis \(where references are available\) and live deployment \(where they are not\)\.
#### Judge dimension\.
The trained judge contributes a dimension scorezjudge\(q,y\)=sθ\(q,y\)z\_\{\\text\{judge\}\}\(q,y\)=s\_\{\\theta\}\(q,y\)directly from the judge model output\. When multiple judge architectures are available, the cascade protocol \(Section[3\.5](https://arxiv.org/html/2606.11196#S3.SS5)\) selects which judge to invoke based on budget constraints, rather than ensembling their scores\.
### 3\.4Online Dimension Calibration
Dimension reliability can vary across task distributions and evolve over time as the inference model pool changes\. Rather than relying solely on offline weight tuning, we introduce online dimension calibration that adjusts weights\{wk\}\\\{w\_\{k\}\\\}during PoQ simulation rounds\.
#### Calibration signal\.
At each round, the calibrator observes dimension scores\{zk\}\\\{z\_\{k\}\\\}and a quality signal for comparison\. Two signal sources are available: \(i\) the consensus scores^\\hat\{s\}\(always available but potentially noisy\), and \(ii\) an*anchor signal*from occasional reference\-based evaluation \(available at a configurable rateα\\alpha, e\.g\.,α=0\.05\\alpha=0\.05means 5% of rounds include an anchor\)\. Anchor signals provide a higher\-fidelity calibration target at the cost of requiring reference answers for a small fraction of queries\.
#### Update strategies\.
We implement three online weight update strategies:
*\(a\) Exponential moving average \(EMA\)\.*For each dimensionkk, track the agreement betweenzkz\_\{k\}and the calibration signal via an exponential moving average\. Weights increase for dimensions with high agreement and saturate toward bound\[wmin,wmax\]\[w\_\{\\text\{min\}\},w\_\{\\text\{max\}\}\]:
ak\(t\)=\(1−η\)ak\(t−1\)\+η⋅𝟙\[\|zk\(t\)−s^\(t\)\|<ϵ\],wk\(t\)∝ak\(t\)a\_\{k\}^\{\(t\)\}=\(1\-\\eta\)\\,a\_\{k\}^\{\(t\-1\)\}\+\\eta\\cdot\\mathbb\{1\}\\\!\\left\[\\left\|z\_\{k\}^\{\(t\)\}\-\\hat\{s\}^\{\(t\)\}\\right\|<\\epsilon\\right\],\\qquad w\_\{k\}^\{\(t\)\}\\propto a\_\{k\}^\{\(t\)\}\(8\)
*\(b\) Bandit \(UCB\-style\)\.*Treat each dimension as an arm in a multi\-armed bandit problem\[[1](https://arxiv.org/html/2606.11196#bib.bib45)\]\. The reward for dimensionkkat roundttis its correlation with the anchor signal over a sliding window\. An upper confidence bound balances exploitation of high\-reward dimensions with exploration of uncertain ones\.
*\(c\) Gradient descent\.*When anchor signals are available, directly minimize the prediction error of the weighted composite with respect to the weight vector𝐰\\mathbf\{w\}\[[38](https://arxiv.org/html/2606.11196#bib.bib46)\]:
𝐰\(t\+1\)=Proj\[𝐰min,𝐰max\]\(𝐰\(t\)−η∇𝐰\(s^𝐰\(t\)−sanchor\(t\)\)2\)\\mathbf\{w\}^\{\(t\+1\)\}=\\text\{Proj\}\_\{\[\\mathbf\{w\}\_\{\\text\{min\}\},\\,\\mathbf\{w\}\_\{\\text\{max\}\}\]\}\\\!\\left\(\\mathbf\{w\}^\{\(t\)\}\-\\eta\\nabla\_\{\\mathbf\{w\}\}\\left\(\\hat\{s\}\_\{\\mathbf\{w\}\}^\{\(t\)\}\-s\_\{\\text\{anchor\}\}^\{\(t\)\}\\right\)^\{2\}\\right\)\(9\)whereProjclips weights to the feasible range\. Gradient calibration is the most sample\-efficient when anchor signals are available, producing interpretable weight vectors that reveal dimension importance\.
#### Dimension gating\.
Optionally, dimensions with sustained low agreement \(EMA below a threshold for a patience window\) can be dynamically*gated*—excluded from the composite until agreement recovers\. Protected dimensions \(e\.g\., structural quality\) are exempt from gating\.
### 3\.5Cascade Evaluation Protocol
Full multi\-dimensional evaluation on every output is unnecessary when many outputs are clearly high or low quality\. We introduce a cascade protocol that routes outputs through progressively more expensive evaluation layers, stopping early when confidence is sufficient\.
#### Layer design\.
The cascade consists of three layers, ordered by increasing cost:
1. 1\.Layer 1 \(Structural\):Zero\-cost dimensions—model prior, cost\-efficiency prior, structural quality\. These scores are precomputed or require only lightweight heuristics\.
2. 2\.Layer 2 \(Lightweight judge\):Reference\-free neural scoring—the trained judge model \(e\.g\., TextCNN or MiniLM\) and query relevance\. Adds moderate cost but no reference dependency\.
3. 3\.Layer 3 \(Full\):All remaining dimensions including reference\-based semantic similarity, alignment, and consensus agreement\. Highest cost, invoked only when necessary\.
#### Confidence estimation\.
After each layer, a confidence estimator determines whether the accumulated evidence is sufficient to produce a reliable composite score\. We combine two signals: \(i\)*extremity*—scores far from the midpoint indicate clear quality, and \(ii\)*agreement*—consistency among dimensions scored so far\. If the combined confidence exceeds a layer\-specific thresholdτℓ\\tau\_\{\\ell\}, evaluation stops and the partial composite is returned\.
#### Budget allocation\.
Each evaluation round operates under a cost budgetBB\. The cascade allocator tracks cumulative cost across layers and respects the budget constraint\. Low budgets force most evaluations to stop at Layer 1 \(structural checks only\), while higher budgets allow progression to Layers 2 and 3\. This design directly supports the cost\-aware philosophy of PoQ\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\], where evaluation cost is an explicit factor in reward computation\.
## 4Experimental Setup
We evaluate the PoQ\-Judge framework along two axes: \(1\)*judge quality*, measuring how well each judge architecture correlates with ground\-truth quality proxies, and \(2\)*mechanism\-level impact*, measuring how judge\-based composite scoring and calibration affect PoQ simulation outcomes\. Our experimental protocol reuses the decentralized inference testbed fromTianet al\.\[[39](https://arxiv.org/html/2606.11196#bib.bib1),[41](https://arxiv.org/html/2606.11196#bib.bib2),[40](https://arxiv.org/html/2606.11196#bib.bib3)\]to ensure comparability across the paper series\.
### 4\.1Tasks and Datasets
We evaluate on two representative task families that stress\-test different quality aspects:
- •Question answering \(QA\):derived from SQuAD\[[33](https://arxiv.org/html/2606.11196#bib.bib36)\], where correctness is sensitive to factual accuracy and instruction compliance\.
- •Summarization:derived from CNN/DailyMail\[[17](https://arxiv.org/html/2606.11196#bib.bib37)\], where semantic coverage and faithfulness are central\.
Each of five inference models generates 200 outputs per task, yielding 2,000 evaluation records \(1,000 QA \+ 1,000 summarization\)\. For judge model training and evaluation, we construct a separate labeled dataset via the two\-stage pipeline described in Section[3\.2](https://arxiv.org/html/2606.11196#S3.SS2): 1,400 training, 300 validation, and 300 held\-out test samples\. The test set is balanced across tasks \(150 QA \+ 150 summarization\) and models\.
Table 3:Dataset overview\. Five inference models generate outputs for QA and summarization tasks, yielding 2,000 evaluation records\. Judge training uses a separate labeled split\.#### Ground\-truth proxy\.
Following our prior work, we use token\-level F1 between the model output and the reference answer as the primary ground\-truth proxy, normalized to a\[0,10\]\[0,10\]scale\[[39](https://arxiv.org/html/2606.11196#bib.bib1),[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\. The GT proxy is well suited for extractive QA, where correct answers are short and token overlap is meaningful, but is a weaker proxy for summarization, where semantic coverage matters more than lexical overlap\[[27](https://arxiv.org/html/2606.11196#bib.bib15),[23](https://arxiv.org/html/2606.11196#bib.bib26)\]\. We additionally collect GPT\-4o\-mini reference\-free scores as a secondary quality signal for training judge models\. The GT score distribution across the full dataset has mean 3\.15 \(std 2\.93\) on the\[0,10\]\[0,10\]scale, reflecting substantial quality heterogeneity across inference models\.
### 4\.2Inference Model Pool
The five inference models in Table[3](https://arxiv.org/html/2606.11196#S4.T3)are selected to capture quality–cost heterogeneity typical of decentralized deployments\. Parameter counts range from 1\.1B \(TinyLlama\) to 3\.8B \(Phi\-3\-mini\), and average per\-sample latency ranges from 1\.8s to 2\.8s\. Larger models \(Llama\-3\.2\-3B, Gemma\-2\-2B\) generally produce higher GT scores, but the relationship between model size and quality is not monotonic: Phi\-3\-mini and Qwen2\-1\.5B occupy a high\-cost, lower\-quality region, illustrating the cost–quality mismatches that PoQ mechanisms must handle\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\]\.
### 4\.3Evaluator Pool and Baselines
#### Reference\-based evaluators \(Paper 3 baselines\)\.
We retain the five pre\-trained evaluators fromTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]as baselines: three sentence\-embedding models operating in bi\-encoder mode \(STS\-paraphrase, STS\-stsb, STS\-MiniLM\) that compute cosine similarity between output and reference embeddings\[[36](https://arxiv.org/html/2606.11196#bib.bib17),[13](https://arxiv.org/html/2606.11196#bib.bib18)\], and two cross\-encoder models \(CE\-MiniLM, CE\-DeBERTa\) trained on NLI tasks\. All reference\-based evaluators require access to a reference answerrr\.
#### Reference\-free evaluators \(this paper\)\.
Our three trained judges \(TextCNN, MiniLM, DeBERTa\) operate on\(q,y\)\(q,y\)pairs without reference answers\. We also evaluate the pre\-trained cross\-encoders in a reference\-free mode \(using\(q,y\)\(q,y\)instead of\(r,y\)\(r,y\)as input\), although these were not trained for this purpose and serve primarily as negative baselines\.
Table 4:Evaluator summary\. Reference\-based evaluators require a ground\-truth answerrr; our trained judges operate on\(q,y\)\(q,y\)pairs only\. Latency is per\-pair on GPU\.EvaluatorTypeRef\. RequiredLatencySourceSTS\-paraphraseBi\-encoder \(STS\)Yes∼\{\\sim\}0\.7 msPre\-trainedSTS\-stsbBi\-encoder \(STS\)Yes∼\{\\sim\}0\.8 msPre\-trainedSTS\-MiniLMBi\-encoder \(STS\)Yes∼\{\\sim\}0\.7 msPre\-trainedCE\-MiniLMCross\-encoder \(NLI\)Yes∼\{\\sim\}0\.5 msPre\-trainedCE\-DeBERTaCross\-encoder \(NLI\)Yes∼\{\\sim\}7\.3 msPre\-trainedTextCNN JudgeTrained CNNNo∼\{\\sim\}1\.0 msThis paperMiniLM JudgeTrained encoderNo∼\{\\sim\}13 msThis paperDeBERTa JudgeTrained encoderNo∼\{\\sim\}15 msThis paper
#### Composite scoring baselines\.
We compare three composite scoring configurations: \(i\)fullmode using all dimensions including reference\-based ones, \(ii\)ref\_freemode using only reference\-free dimensions plus the trained judge, and \(iii\) individual evaluator baselines\. For composite modes, the DeBERTa judge is used as the default judge dimension unless otherwise noted\.
### 4\.4Evaluation Protocol
#### Judge evaluation\.
All judge quality metrics are computed on the held\-out test set \(n=300n\{=\}300\), which is disjoint from both the pre\-training and fine\-tuning data\. We report Pearson correlation, Spearman correlation, and mean absolute error \(MAE\) against the GT proxy\. Bootstrap 95% confidence intervals are computed via 2,000 resamples to quantify uncertainty given the moderate test set size\. Per\-task breakdowns \(QA vs\. summarization,n=150n\{=\}150each\) and per\-model breakdowns \(five inference models,n=55n\{=\}55–6666each\) provide granular diagnostic information\.
#### Composite scoring evaluation\.
Composite scoring modes are evaluated on the same test set by computing Pearson correlation between the composite score and GT\. Per\-dimension GT correlations are reported to characterize dimension reliability and motivate calibration\.
#### Online calibration evaluation\.
Calibration experiments use the full 2,000\-record dataset within a Monte Carlo PoQ simulation\. We runT=5,000T\{=\}5\{,\}000rounds per configuration withK=3K\{=\}3evaluators sampled per job\. Eight scenarios are compared: no calibration \(baseline\), three EMA variants \(0%, 1%, 5%, 10% anchor\), EMA with dimension gating, bandit with 5% anchor, and gradient with 5% anchor\. The primary metric is average inference reward; we additionally report the final calibrated weight vector for the gradient strategy\.
#### Cascade evaluation\.
Cascade experiments sweep six budget levels \(B∈\{0\.05,0\.10,0\.20,0\.30,0\.50,1\.00\}B\\in\\\{0\.05,0\.10,0\.20,0\.30,0\.50,1\.00\\\}\) and report GT Pearson, cost savings relative to full evaluation, average number of layers traversed, and the stop distribution across layers\.
#### PoQ simulation parameters\.
Cost normalization, reward functions, and trust weighting follow the configurations established inTianet al\.\[[39](https://arxiv.org/html/2606.11196#bib.bib1),[41](https://arxiv.org/html/2606.11196#bib.bib2)\]\. Inference costs\{ci\}\\\{c\_\{i\}\\\}and evaluation costs\{ce\}\\\{c\_\{e\}\\\}are derived from measured latency profiles \(Table[3](https://arxiv.org/html/2606.11196#S4.T3)and Table[4](https://arxiv.org/html/2606.11196#S4.T4)\)\. The consensus method is adaptive weighted mean with trust updates as described in Section[2\.1](https://arxiv.org/html/2606.11196#S2.SS1)\.
## 5Results
We present results in six parts: judge model quality \(Section[5\.1](https://arxiv.org/html/2606.11196#S5.SS1)\), task dependence \(Section[5\.2](https://arxiv.org/html/2606.11196#S5.SS2)\), reference\-free vs\. reference\-based composite scoring \(Section[5\.3](https://arxiv.org/html/2606.11196#S5.SS3)\), online dimension calibration \(Section[5\.4](https://arxiv.org/html/2606.11196#S5.SS4)\), cascade evaluation \(Section[5\.5](https://arxiv.org/html/2606.11196#S5.SS5)\), and training dynamics \(Section[5\.6](https://arxiv.org/html/2606.11196#S5.SS6)\)\. Unless otherwise noted, all judge quality metrics are computed on the held\-out test set \(n=300n\{=\}300\) with bootstrap 95% confidence intervals\.
### 5\.1Judge Model Quality and Cost Tradeoff
Table[5](https://arxiv.org/html/2606.11196#S5.T5)reports judge performance on the held\-out test set alongside reference\-based evaluator baselines from our prior framework\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\.
Table 5:Judge model comparison on the held\-out test set \(n=300n\{=\}300\)\. Pearson and Spearman correlations are computed against the GT proxy\. Bootstrap 95% confidence intervals are shown for Pearsonrr\. Reference\-based baselines require access to a ground\-truth answer\.ModelParamsPearson↑\\uparrow95% CISpearman↑\\uparrowLatencyTextCNN Judge10M0\.472\[0\.346, 0\.594\]0\.4521\.0 msMiniLM Judge22M0\.676\[0\.567, 0\.771\]0\.68513\.3 msDeBERTa Judge184M0\.747\[0\.663, 0\.816\]0\.73314\.8 msReference\-based baselines \(require ground\-truth answer\)STS\-paraphrase22M0\.629\[0\.554, 0\.693\]0\.683∼\{\\sim\}0\.7 msSTS\-stsb66M0\.647\[0\.578, 0\.711\]0\.698∼\{\\sim\}0\.8 msSTS\-MiniLM22M0\.629\[0\.558, 0\.693\]0\.700∼\{\\sim\}0\.7 msCE\-MiniLM22M0\.331\[0\.238, 0\.415\]0\.357∼\{\\sim\}0\.5 msCE\-DeBERTa184M−\-0\.363\[−\-0\.480,−\-0\.237\]−\-0\.250∼\{\\sim\}7\.3 msThree findings emerge from Table[5](https://arxiv.org/html/2606.11196#S5.T5)\.
First, the DeBERTa judge achieves the highest GT Pearson correlation \(0\.747\) of any evaluator in the table, including all reference\-based baselines\. This is notable because the judge operates without access to a reference answer, while the STS baselines compute embedding similarity against the ground\-truth response\. The confidence interval\[0\.663,0\.816\]\[0\.663,0\.816\]does not overlap with the upper bound of the best reference\-based evaluator \(STS\-stsb:\[0\.578,0\.711\]\[0\.578,0\.711\]\), providing moderate evidence that the trained judge is at least as strong\.
Second, the three judge architectures span a clear quality–cost Pareto frontier\. TextCNN achieves Pearson 0\.472 at sub\-millisecond latency, offering a15×15\\timesspeed advantage over DeBERTa at the cost of a 0\.275 correlation gap\. MiniLM occupies a balanced position at 0\.676 Pearson with 13ms latency\. Figure[3](https://arxiv.org/html/2606.11196#S5.F3)visualizes the Pareto tradeoff\.
Third, pre\-trained cross\-encoders used in reference\-based mode \(CE\-MiniLM, CE\-DeBERTa\) perform poorly or negatively\. CE\-DeBERTa achieves−0\.363\-0\.363Pearson, confirming that NLI\-trained models are not suitable as quality evaluators without task\-specific fine\-tuning\. This validates the need for dedicated judge training\.
Figure 3:Left: GT Pearson correlation for all judge models and the best reference\-based baseline \(dashed line\)\. Right: Quality–cost Pareto frontier for reference\-free judges, showing the tradeoff between Pearson correlation and inference latency\. DeBERTa dominates in quality; TextCNN dominates in cost\.
### 5\.2Task Dependence Analysis
Table[6](https://arxiv.org/html/2606.11196#S5.T6)decomposes judge performance by task type\. All three judges exhibit a dramatic gap between QA and summarization performance\.
Table 6:Per\-task Pearson correlation with GT on the test set \(n=150n\{=\}150per task\)\. All judges perform substantially better on QA than summarization\. Bootstrap 95% CIs are shown\.On QA, the DeBERTa judge reaches Pearson 0\.830, with the confidence interval lower bound \(0\.745\) exceeding the point estimate of any reference\-based evaluator\. MiniLM achieves 0\.755, and even TextCNN reaches 0\.545\. These results indicate that all three architectures successfully learn to assess QA response quality from\(q,y\)\(q,y\)pairs alone\.
On summarization, all judges drop sharply: DeBERTa to 0\.199, MiniLM to 0\.231, TextCNN to 0\.119\. The TextCNN confidence interval includes zero \(\[−0\.041,0\.287\]\[\-0\.041,0\.287\]\), indicating that its summarization scores are not significantly correlated with GT\. This task dependence is not unique to our judges\. The same GT proxy—token\-level F1—was shown inTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]to be a weak signal for summarization, where semantic coverage and factual consistency matter more than lexical overlap\[[27](https://arxiv.org/html/2606.11196#bib.bib15),[23](https://arxiv.org/html/2606.11196#bib.bib26),[25](https://arxiv.org/html/2606.11196#bib.bib27)\]\. Since our judges are trained on this GT proxy \(via GPT labels that partially inherit its limitations\), the summarization performance reflects a ground\-truth quality gap rather than a pure model capacity limitation\.
Figure[4](https://arxiv.org/html/2606.11196#S5.F4)visualizes the per\-task breakdown\.
Figure 4:Judge performance by task type\. QA Pearson correlations are strong across all architectures; summarization correlations are weak, reflecting GT proxy limitations for this task\.
### 5\.3Reference\-Free vs\. Reference\-Based Composite Scoring
Table[7](https://arxiv.org/html/2606.11196#S5.T7)compares composite scoring modes and individual evaluator baselines on the test set\.
Table 7:Composite scoring modes and single\-evaluator baselines on the test set \(n=300n\{=\}300\)\. Theref\_freemode integrates the DeBERTa judge with structural priors; thefullmode additionally includes reference\-based dimensions\. Single\-dimension GT correlations are shown for context\.MethodPearson↑\\uparrow95% CIRef\. required?Composite scoring modesFull \(ref\-based\)0\.380\[0\.277, 0\.482\]YesRef\-free \(with judge\)0\.645\[0\.562, 0\.718\]NoAuto \(adaptive\)0\.645\[0\.562, 0\.718\]DependsBest single evaluators \(reference\-based\)STS\-stsb0\.647\[0\.578, 0\.711\]YesSTS\-paraphrase0\.629\[0\.554, 0\.693\]YesSTS\-MiniLM0\.629\[0\.558, 0\.693\]YesSingle\-dimension GT correlationsSemantic quality0\.647—YesStructure quality0\.368—NoModel prior0\.280—NoQuery relevance−\-0\.364—NoConsensus agreement−\-0\.209—PartialThe central finding is that reference\-free composite scoring \(Pearson 0\.645\) matches the best single reference\-based evaluator \(STS\-stsb: 0\.647\) without requiring any reference answer\. The confidence intervals overlap substantially \(\[0\.562,0\.718\]\[0\.562,0\.718\]vs\.\[0\.578,0\.711\]\[0\.578,0\.711\]\), indicating that the two approaches are statistically indistinguishable atn=300n\{=\}300\. This closes the deployment gap identified in Section[2\.2](https://arxiv.org/html/2606.11196#S2.SS2): a PoQ network can now achieve comparable quality assessment without maintaining a reference answer database\.
The full composite mode \(0\.380\) performs substantially*worse*than reference\-free mode \(0\.645\)\. This counterintuitive result is explained by the dimension correlation analysis in Table[7](https://arxiv.org/html/2606.11196#S5.T7): two reference\-based dimensions—query–output alignment \(−0\.293\-0\.293Pearson with GT\) and consensus agreement \(−0\.209\-0\.209\)—are negatively correlated with quality on this dataset\. When these dimensions are included with positive weights, they drag the composite below the reference\-free alternative\. This finding is consistent with the calibration failure mode identified inTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\], where alignment and agreement dimensions exhibited negative or task\-dependent correlations and required removal or down\-weighting\.
Figure[5](https://arxiv.org/html/2606.11196#S5.F5)visualizes the scoring mode comparison\.
Figure 5:Left: Composite scoring modes compared by GT Pearson\. Reference\-free mode with the trained judge outperforms the full reference\-based composite\. Right: Individual scoring method correlations, showing the dominance of STS\-based evaluators and the negative impact of cross\-encoder and agreement dimensions\.
### 5\.4Online Dimension Calibration
Table[8](https://arxiv.org/html/2606.11196#S5.T8)reports calibration results across eight scenarios within the PoQ simulation\.
Table 8:Online calibration results\. Average inference reward across 5,000 Monte Carlo rounds withK=3K\{=\}3evaluators per job\. The uncalibrated baseline uses equal weights across all dimensions\.None of the calibration strategies improve average inference reward over the uncalibrated baseline\. This result requires careful interpretation\. The uncalibrated baseline assigns equal weights to all dimensions, including those that correlate well with the consensus signal used for rewards\. Calibration strategies that shift weight toward dimensions more correlated with the*anchor*GT signal can decrease correlation with the consensus signal, reducing rewards within the simulation even when the resulting weight vector is more aligned with true quality\. In other words, the calibration objective \(GT alignment\) and the reward objective \(consensus alignment\) can diverge\.
Despite not improving simulation reward, the gradient calibration strategy produces the most*interpretable*weight vector\. Table[9](https://arxiv.org/html/2606.11196#S5.T9)shows the final learned weights\.
Table 9:Learned dimension weights from gradient\-based calibration with 5% anchor ratio\. Initial weight is 1\.0 for all dimensions\. Semantic quality receives4\.7×4\.7\\timesthe initial weight; structural priors are suppressed to near zero\.The gradient\-learned weights are highly consistent with the offline dimension reliability analysis\. Semantic quality—the dimension with the highest GT Pearson \(0\.647\) in Table[7](https://arxiv.org/html/2606.11196#S5.T7)—receives4\.7×4\.7\\timesthe initial weight, dominating the composite\. Dimensions with negative GT correlations \(query–output alignment:−0\.293\-0\.293; consensus agreement:−0\.209\-0\.209\) are correctly suppressed to 0\.37 and 0\.25 respectively\. Structure quality and model prior, which have moderate but noisy positive correlations \(0\.368, 0\.280\), are pushed to near zero, suggesting that their contribution is subsumed by the dominant semantic dimension\. This demonstrates that online gradient calibration can automatically recover the “remove unreliable dimensions” strategy advocated by offline ablation analysis inTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\], without requiring manual intervention\.
Figure[6](https://arxiv.org/html/2606.11196#S5.F6)visualizes the scenario comparison and learned weights\.
Figure 6:Left: Average inference reward across calibration scenarios\. Right: Final dimension weights from gradient\-based calibration, showing strong concentration on semantic quality and suppression of unreliable dimensions\.
### 5\.5Cascade Evaluation
Table[10](https://arxiv.org/html/2606.11196#S5.T10)reports the cascade evaluation results under six budget levels\.
Table 10:Cascade evaluation under varying budget constraints\. Cost savings are relative to full three\-layer evaluation\. Two regimes emerge: low\-budget \(structural only\) and high\-budget \(full evaluation\)\.Two operating regimes are evident\. In the*low\-budget regime*\(B≤0\.10B\\leq 0\.10\), the cascade routes nearly all evaluations through Layer 1 \(structural priors only\), achieving 72% cost savings while maintaining GT Pearson≈0\.51\{\\approx\}0\.51\. This is a practical operating point for high\-throughput scenarios where approximate quality filtering suffices—for example, detecting degenerate or clearly low\-quality outputs before they enter the reward pipeline\.
In the*high\-budget regime*\(B≥0\.20B\\geq 0\.20\), most evaluations proceed to Layer 3 \(full evaluation\), and the cascade provides modest savings \(15\.7%\) with no quality improvement over full evaluation\. The transition between regimes is sharp: increasing budget from 0\.10 to 0\.20 causes average layers to jump from 0\.18 to 2\.68\.
Notably, Layer 2 \(lightweight judge\) is bypassed in all configurations, with evaluations jumping directly from Layer 1 to Layer 3\. This occurs because the current confidence estimator does not accumulate sufficient confidence from the Layer 2 dimensions \(query relevance and judge score\) to avoid triggering Layer 3\. A more refined confidence estimator, or the use of the DeBERTa judge in Layer 2 instead of a default score, would likely improve the cascade’s granularity and is a direction for future work\.
Figure[7](https://arxiv.org/html/2606.11196#S5.F7)visualizes the quality–savings tradeoff\.
Figure 7:Left: GT Pearson vs\. cost savings across budget levels, showing the low\-budget regime \(B≤0\.10B\\leq 0\.10\) as a favorable operating point\. Right: GT Pearson and savings by budget level, illustrating the sharp transition between the two regimes\.
### 5\.6Training Dynamics
Figure[8](https://arxiv.org/html/2606.11196#S5.F8)shows validation loss and Pearson correlation during fine\-tuning for all three architectures\.
Figure 8:Fine\-tuning dynamics across judge architectures\. Left: validation loss\. Right: validation Pearson correlation\. MiniLM converges fastest; DeBERTa achieves peak performance at epoch 10 before mild overfitting; TextCNN requires the full 20 epochs\.Table 11:Training configuration and outcomes for each judge architecture\. Pre\-training on UltraFeedback \(45k train / 5k val\) is followed by fine\-tuning on domain data \(1,400 train / 300 val\)\.Three observations stand out\. First, MiniLM exhibits the fastest convergence, reaching near\-peak validation Pearson \(∼0\.70\{\\sim\}0\.70\) within 3–5 fine\-tuning epochs\. This is consistent with the cross\-encoder backbone being pre\-trained on semantic similarity tasks that share structural similarities with quality assessment\. Second, DeBERTa achieves its best validation loss at epoch 10 \(val Pearson 0\.722\) but exhibits mild overfitting thereafter, with validation loss increasing from 1\.001 to 1\.203 by epoch 15\. The early\-stopping checkpoint \(epoch 10\) is used for all reported results\. Third, TextCNN requires the full 20 fine\-tuning epochs to converge, consistent with its parameters being trained from scratch rather than fine\-tuned from a language\-model initialization\.
An interesting discrepancy appears between validation and test performance: DeBERTa’s test Pearson \(0\.747\) exceeds its best validation Pearson \(0\.722\), while MiniLM’s test Pearson \(0\.676\) falls below its validation Pearson \(0\.740\)\. This variability is expected given the moderate sample sizes \(n=300n\{=\}300\) and is captured by the bootstrap confidence intervals reported in Table[5](https://arxiv.org/html/2606.11196#S5.T5)\.
## 6Discussion
Our results demonstrate that trained reference\-free judges can close the deployment gap between offline multi\-dimensional scoring and live decentralized inference\. This section discusses the main implications, practical guidance, and limitations\.
#### Reference\-free evaluation is viable for PoQ\.
The DeBERTa judge achieves Pearson 0\.747 with GT on the held\-out test set, exceeding all reference\-based evaluators from our prior framework \(best: STS\-stsb at 0\.647\)\. When integrated into composite scoring, the reference\-free mode attains 0\.645—statistically indistinguishable from the best single reference\-based evaluator\. This result validates the core hypothesis of this paper: the quality assessment needed for PoQ incentives can be performed without reference answers, using compact encoder models trained via a two\-stage pipeline\. For decentralized deployments, this eliminates the requirement to maintain and distribute reference answer databases, substantially simplifying the evaluation infrastructure\.
#### Task dependence is the primary open challenge\.
The most striking finding is the dramatic gap between QA \(Pearson 0\.830\) and summarization \(0\.199\) for all judge architectures\. We attribute this primarily to the ground\-truth proxy: token\-level F1 is a natural measure for extractive QA, where correct answers are short spans with well\-defined token overlap, but it poorly captures summarization quality, where semantic coverage, faithfulness, and fluency matter more than lexical overlap\[[27](https://arxiv.org/html/2606.11196#bib.bib15),[23](https://arxiv.org/html/2606.11196#bib.bib26),[25](https://arxiv.org/html/2606.11196#bib.bib27)\]\. Since our GPT\-labeled training data partially inherits the characteristics of this proxy through the task distribution, judges trained on this signal cannot be expected to transcend its limitations\.
This observation has two implications\. First, improving the GT proxy for summarization—for example, by incorporating ROUGE variants, factual consistency scores\[[23](https://arxiv.org/html/2606.11196#bib.bib26)\], or dedicated summarization evaluators\[[25](https://arxiv.org/html/2606.11196#bib.bib27)\]—would likely improve judge training and evaluation for this task family\. Second, task\-specific judge training or task\-conditional scoring may be necessary for deployments spanning diverse task distributions, echoing the task\-aware calibration strategies advocated inTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\.
#### Architecture selection depends on deployment context\.
The15×15\\timeslatency gap between TextCNN \(∼\{\\sim\}1ms\) and DeBERTa \(∼\{\\sim\}15ms\) creates a natural tiering strategy for PoQ deployments:
- •High\-throughput / cost\-sensitive:TextCNN as a coarse filter\. At Pearson 0\.472, it can separate clearly good from clearly poor outputs at minimal cost, suitable for high\-volume pre\-screening or as the lightweight tier in a cascade\.
- •Balanced:MiniLM at Pearson 0\.676 and 13ms offers a practical default for most evaluation rounds, providing strong quality assessment without the memory footprint of DeBERTa \(87MB vs\. 702MB checkpoint\)\.
- •Quality\-critical:DeBERTa at Pearson 0\.747 for high\-stakes evaluations or when evaluation budget is sufficient\. The 15ms latency is well within typical PoQ round budgets of hundreds of milliseconds\.
The cascade evaluation protocol \(Section[5\.5](https://arxiv.org/html/2606.11196#S5.SS5)\) provides a principled mechanism for combining these tiers, although the current implementation would benefit from a more refined confidence estimator that better utilizes Layer 2 \(lightweight judge\) before escalating to full evaluation\.
#### Gradient calibration recovers offline insights automatically\.
Online gradient\-based weight learning \(Table[9](https://arxiv.org/html/2606.11196#S5.T9)\) produces a dimension ranking that closely matches the offline reliability analysis fromTianet al\.\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]: semantic quality is identified as the dominant signal \(4\.7×4\.7\\timesinitial weight\), while negatively\-correlated dimensions \(alignment, agreement\) are suppressed\. This demonstrates that principled online calibration can substitute for manual ablation studies, which is important for production systems where the task distribution may evolve and offline analysis cannot be performed continuously\.
The observation that EMA\-based calibration drives all weights to the upper bound \(5\.0\) regardless of anchor ratio indicates that EMA agreement tracking is insufficiently discriminative for this setting\. This is consistent with the relatively high dimension–consensus agreement even for unreliable dimensions: when the consensus itself is noisy, most dimensions appear “agreeable” by this measure\. Gradient calibration avoids this pitfall by directly optimizing prediction error against anchor signals, producing meaningful differentiation\.
#### The calibration–reward divergence\.
An important subtlety is that improved GT alignment does not automatically translate to improved simulation reward\. In our experiments, the uncalibrated baseline achieves the highest average reward \(0\.5326\) despite having lower GT alignment than the gradient\-calibrated variant \(0\.5186\)\. This occurs because PoQ rewards are based on consensus alignment rather than GT alignment: a composite that happens to correlate well with the consensus signal \(even if the consensus is imperfect\) will produce higher rewards than one calibrated toward a more “correct” but consensus\-divergent target\. In practice, this suggests that calibration is most valuable when the system designer can periodically inject high\-quality anchor signals and is willing to accept short\-term reward reductions in exchange for longer\-term signal quality improvements\.
#### Limitations\.
Several limitations should be noted\. First, the GT proxy \(token\-level F1\) is a significant bottleneck for summarization evaluation, as discussed above\. Second, GPT\-labeled training data introduces teacher model bias: judges inherit any systematic preferences or blind spots of the GPT\-4o\-mini labeler\[[7](https://arxiv.org/html/2606.11196#bib.bib25),[46](https://arxiv.org/html/2606.11196#bib.bib10)\]\. Third, the held\-out test set \(n=300n\{=\}300\) limits statistical power, particularly for per\-task and per\-model analyses where subgroup sizes are 55–150\. Fourth, all experiments use a single domain \(SQuAD \+ CNN/DailyMail\); generalization to instruction\-following, creative writing, or code generation tasks is untested\. Fifth, the current cascade implementation does not fully utilize the trained judge in Layer 2 \(the judge dimension returns a default score when not properly initialized with a loaded model\), which should be addressed in future implementations\. Finally, adversarial robustness of the judges themselves—e\.g\., susceptibility to adversarial outputs designed to inflate judge scores—is not studied here and represents an important direction for future work, building on the threat models established inTianet al\.\[[41](https://arxiv.org/html/2606.11196#bib.bib2)\]\.
## 7Related Work
#### LLM\-as\-Judge and learned evaluators\.
Using language models as evaluators has gained prominence through LLM\-as\-a\-Judge frameworks, where strong models such as GPT\-4 are prompted to assess output quality, achieving notable human correlation on benchmarks like MT\-Bench\[[46](https://arxiv.org/html/2606.11196#bib.bib10),[28](https://arxiv.org/html/2606.11196#bib.bib24)\]\. Chatbot Arena provides a large\-scale human preference platform with associated ranking methodologies\[[8](https://arxiv.org/html/2606.11196#bib.bib11),[16](https://arxiv.org/html/2606.11196#bib.bib39)\]\. However, LLM judges exhibit systematic biases including position preference, verbosity preference, and self\-enhancement\[[7](https://arxiv.org/html/2606.11196#bib.bib25)\]\. Prometheus and its successor demonstrate that open\-source models can be fine\-tuned for evaluation\[[20](https://arxiv.org/html/2606.11196#bib.bib21),[21](https://arxiv.org/html/2606.11196#bib.bib22)\], using rubric\-based prompting and AI feedback data such as UltraFeedback\[[9](https://arxiv.org/html/2606.11196#bib.bib23)\]\. Our work differs in targeting much smaller models \(10M–184M vs\. billion\-parameter evaluators\) optimized for the latency and cost constraints of decentralized PoQ evaluation rather than general\-purpose LLM assessment\.
#### Automatic evaluation metrics\.
Classic overlap\-based metrics such as BLEU and ROUGE remain widely deployed but are known to be brittle proxies for human judgment, particularly for open\-ended generation\[[31](https://arxiv.org/html/2606.11196#bib.bib16),[27](https://arxiv.org/html/2606.11196#bib.bib15)\]\. Embedding\-based metrics improve correlation by operating in learned representation spaces: BERTScore uses contextual token embeddings\[[43](https://arxiv.org/html/2606.11196#bib.bib13)\], MoverScore uses earth mover distance over embeddings\[[45](https://arxiv.org/html/2606.11196#bib.bib42)\], and BLEURT trains a regression model on human ratings\[[37](https://arxiv.org/html/2606.11196#bib.bib14)\]\. COMET further demonstrates that learned metrics trained with human post\-edits achieve strong performance in machine translation evaluation\[[35](https://arxiv.org/html/2606.11196#bib.bib41)\]\. For summarization, specialized evaluators target factual consistency via NLI\-based models\[[23](https://arxiv.org/html/2606.11196#bib.bib26),[25](https://arxiv.org/html/2606.11196#bib.bib27)\]\. Sentence\-level representations from Sentence\-BERT\[[36](https://arxiv.org/html/2606.11196#bib.bib17)\]and SimCSE\[[13](https://arxiv.org/html/2606.11196#bib.bib18)\]underpin many of these approaches\. Our judges share the learned\-metric philosophy but are trained end\-to\-end for reference\-free quality regression rather than adapted from pre\-trained similarity or entailment models\.
#### Lightweight text encoders\.
TextCNN\[[22](https://arxiv.org/html/2606.11196#bib.bib38)\]demonstrated that simple convolutional architectures over word embeddings can be competitive for text classification, motivating our ultra\-lightweight judge tier\. DeBERTa introduced disentangled attention over content and position\[[15](https://arxiv.org/html/2606.11196#bib.bib19)\], with DeBERTaV3 incorporating ELECTRA\-style pre\-training\[[14](https://arxiv.org/html/2606.11196#bib.bib20)\]; we use DeBERTa\-v3\-base as our highest\-quality judge backbone\. The multi\-dimensional quality assessment paradigm has a long history in translation evaluation, where frameworks such as MQM decompose quality into interpretable categories\[[29](https://arxiv.org/html/2606.11196#bib.bib40)\]; our composite scoring framework extends this principle to decentralized LLM inference\.
#### PoQ and decentralized inference\.
Proof of Quality uses lightweight evaluators to produce a network\-level quality signal for decentralized inference, replacing costly cryptographic verification\[[44](https://arxiv.org/html/2606.11196#bib.bib4),[32](https://arxiv.org/html/2606.11196#bib.bib31),[3](https://arxiv.org/html/2606.11196#bib.bib30)\]\. Collaborative inference systems such as Petals demonstrate the feasibility of distributed LLM serving\[[5](https://arxiv.org/html/2606.11196#bib.bib5)\], while efficient serving techniques address throughput and memory bottlenecks\[[24](https://arxiv.org/html/2606.11196#bib.bib28),[10](https://arxiv.org/html/2606.11196#bib.bib29)\]\. Our prior work developed cost\-aware PoQ\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\], adaptive robust PoQ with Byzantine\-resilient aggregation and trust weighting\[[41](https://arxiv.org/html/2606.11196#bib.bib2)\], and multi\-dimensional quality scoring\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\. The present paper extends this line by introducing trained reference\-free judges that close the deployment gap identified in the multi\-dimensional framework\.
#### Robust aggregation and trust mechanisms\.
Byzantine\-fault\-tolerant consensus\[[6](https://arxiv.org/html/2606.11196#bib.bib8)\]and Byzantine\-robust distributed learning\[[4](https://arxiv.org/html/2606.11196#bib.bib6),[42](https://arxiv.org/html/2606.11196#bib.bib7),[12](https://arxiv.org/html/2606.11196#bib.bib9)\]provide foundational principles for tolerating adversarial participants\. Crowdsourcing reliability estimation, including the Dawid–Skene model\[[11](https://arxiv.org/html/2606.11196#bib.bib32)\]and learning\-from\-crowds frameworks\[[34](https://arxiv.org/html/2606.11196#bib.bib33)\], addresses annotator heterogeneity in settings analogous to our evaluator pool\. Reputation systems such as EigenTrust\[[19](https://arxiv.org/html/2606.11196#bib.bib43)\]and peer prediction\[[30](https://arxiv.org/html/2606.11196#bib.bib44)\]offer complementary mechanisms for assessing participant reliability in decentralized networks\. Federated learning shares concerns of participant heterogeneity and adversarial manipulation\[[18](https://arxiv.org/html/2606.11196#bib.bib34),[2](https://arxiv.org/html/2606.11196#bib.bib35)\], though our setting involves evaluator scoring rather than model training\.
#### Online learning and adaptive calibration\.
Online convex optimization\[[38](https://arxiv.org/html/2606.11196#bib.bib46)\]provides the theoretical foundation for our gradient\-based dimension calibration, while multi\-armed bandit algorithms\[[1](https://arxiv.org/html/2606.11196#bib.bib45)\]motivate the UCB\-style dimension selection strategy\. Holistic evaluation frameworks such as HELM\[[26](https://arxiv.org/html/2606.11196#bib.bib12)\]advocate decomposing model assessment into multiple axes, a philosophy we operationalize for decentralized inference with online adaptability\.
## 8Conclusion
We introduced PoQ\-Judge, a multi\-architecture reference\-free evaluation framework for decentralized LLM inference under Proof of Quality\. By training three dedicated judge models—TextCNN \(10M parameters\), MiniLM \(22M\), and DeBERTa \(184M\)—via a two\-stage pipeline transferring evaluation knowledge from large\-scale AI feedback to the PoQ task distribution, we close the deployment gap between offline quality analysis and live decentralized serving\.
Our experiments demonstrate several key findings\. The DeBERTa judge achieves Pearson correlation 0\.747 with the ground\-truth quality proxy on a held\-out test set, exceeding all reference\-based evaluators from our prior multi\-dimensional framework\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\]\. Reference\-free composite scoring \(Pearson 0\.645\) matches the best single reference\-based evaluator without requiring reference answers, validating the viability of trained judges for PoQ deployment\. Online gradient\-based calibration automatically identifies semantic quality as the dominant dimension while suppressing unreliable signals, recovering the insights of manual ablation analysis\. The cascade evaluation protocol achieves up to 72\.7% cost savings in a low\-budget operating regime using structural priors alone\. At the same time, we observe sharp task dependence—QA Pearson reaches 0\.830 while summarization drops to 0\.199—attributable primarily to limitations of the token\-level F1 ground\-truth proxy for summarization evaluation\.
Together with our prior work on cost\-aware PoQ\[[39](https://arxiv.org/html/2606.11196#bib.bib1)\], adversarial robustness\[[41](https://arxiv.org/html/2606.11196#bib.bib2)\], and multi\-dimensional scoring\[[40](https://arxiv.org/html/2606.11196#bib.bib3)\], PoQ\-Judge provides a complete pipeline from quality measurement through incentive allocation for decentralized LLM inference\.
#### Future work\.
Several directions merit investigation\. Task\-specific ground\-truth proxies incorporating ROUGE, factual consistency, and semantic similarity would improve summarization evaluation and judge training\. Larger and more diverse training sets, including instruction\-following and code generation tasks, would test generalization\. Adaptive architecture selection—automatically routing evaluations to the appropriate judge tier based on predicted difficulty—would extend the cascade protocol\. Adversarial robustness of the judges themselves, including resistance to outputs crafted to inflate quality scores, is an important deployment concern\. Finally, integrating PoQ\-Judge with Sybil\-resistant identity mechanisms would address a known limitation of decentralized consensus systems where attackers can scale their influence through multiple identities\[[41](https://arxiv.org/html/2606.11196#bib.bib2),[18](https://arxiv.org/html/2606.11196#bib.bib34)\]\.
## References
- \[1\]\(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine Learning47\(2–3\),pp\. 235–256\.Cited by:[§3\.4](https://arxiv.org/html/2606.11196#S3.SS4.SSS0.Px2.p3.2),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px6.p1.1)\.
- \[2\]E\. Bagdasaryan, A\. Veit, Y\. Hua, D\. Estrin, and V\. Shmatikov\(2020\)How to backdoor federated learning\.InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics,Vol\.108,pp\. 2938–2948\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[3\]E\. Ben\-Sasson, A\. Chiesa, D\. Genkin, E\. Tromer, and M\. Virza\(2014\)Succinct non\-interactive arguments for a von Neumann architecture\.In23rd USENIX Security Symposium,pp\. 781–796\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px1.p1.9),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1)\.
- \[4\]P\. Blanchard, E\. M\. El Mhamdi, R\. Guerraoui, and J\. Stainer\(2017\)Machine learning with adversaries: Byzantine tolerant gradient descent\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px3.p1.2),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[5\]A\. Borzunov, D\. Baranchuk, T\. Dettmers, M\. Riabinin, Y\. Belkada, A\. Chumachenko, P\. Samygin, and C\. Raffel\(2023\)Petals: collaborative inference and fine\-tuning of large models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),pp\. 558–568\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1)\.
- \[6\]M\. Castro and B\. Liskov\(1999\)Practical Byzantine fault tolerance\.InProceedings of the Third Symposium on Operating Systems Design and Implementation \(OSDI ’99\),pp\. 173–186\.Cited by:[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px3.p1.2),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[7\]G\. H\. Chen, S\. Chen, Z\. Liu, F\. Jiang, and B\. Wang\(2024\)Humans or LLMs as the judge? a study on judgement bias\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 8301–8327\.Cited by:[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p3.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px6.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[8\]W\. Chiang, L\. Zheng, Y\. Sheng,et al\.\(2024\)Chatbot arena: an open platform for evaluating LLMs by human preference\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p4.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[9\]G\. Cui, L\. Yuan, N\. Ding, G\. Yao, B\. He, W\. Zhu, Y\. Ni, G\. Xie, R\. Xie, Y\. Lin, Z\. Liu, and M\. Sun\(2024\)UltraFeedback: boosting language models with scaled AI feedback\.arXiv preprint arXiv:2310\.01377\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p5.2),[§3\.2](https://arxiv.org/html/2606.11196#S3.SS2.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[10\]T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré\(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1)\.
- \[11\]A\. P\. Dawid and A\. M\. Skene\(1979\)Maximum likelihood estimation of observer error\-rates using the EM algorithm\.Journal of the Royal Statistical Society\. Series C \(Applied Statistics\)28\(1\),pp\. 20–28\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[12\]E\. M\. El Mhamdi, R\. Guerraoui, and S\. Rouault\(2018\)The hidden vulnerability of distributed learning in Byzantium\.InProceedings of the 35th International Conference on Machine Learning,Cited by:[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px3.p1.2),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[13\]T\. Gao, X\. Yao, and D\. Chen\(2021\)SimCSE: simple contrastive learning of sentence embeddings\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 6894–6910\.Cited by:[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p1.3),[§4\.3](https://arxiv.org/html/2606.11196#S4.SS3.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[14\]P\. He, J\. Gao, and W\. Chen\(2023\)DeBERTaV3: improving DeBERTa using ELECTRA\-style pre\-training with gradient\-disentangled embedding sharing\.InInternational Conference on Learning Representations,Cited by:[3rd item](https://arxiv.org/html/2606.11196#S1.I1.i3.p1.1),[§3\.1](https://arxiv.org/html/2606.11196#S3.SS1.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px3.p1.1)\.
- \[15\]P\. He, X\. Liu, J\. Gao, and W\. Chen\(2021\)DeBERTa: decoding\-enhanced BERT with disentangled attention\.InInternational Conference on Learning Representations,Cited by:[3rd item](https://arxiv.org/html/2606.11196#S1.I1.i3.p1.1),[§3\.1](https://arxiv.org/html/2606.11196#S3.SS1.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px3.p1.1)\.
- \[16\]R\. Herbrich, T\. Minka, and T\. Graepel\(2007\)TrueSkill™: a Bayesian skill rating system\.InAdvances in Neural Information Processing Systems,Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[17\]K\. M\. Hermann, T\. Kočiský, E\. Grefenstette, L\. Espeholt, W\. Kay, M\. Suleyman, and P\. Blunsom\(2015\)Teaching machines to read and comprehend\.InAdvances in Neural Information Processing Systems,Vol\.28\.Cited by:[§3\.2](https://arxiv.org/html/2606.11196#S3.SS2.SSS0.Px2.p1.1),[2nd item](https://arxiv.org/html/2606.11196#S4.I1.i2.p1.1)\.
- \[18\]P\. Kairouz, H\. B\. McMahan, B\. Avent, A\. Bellet, M\. Bennis, A\. N\. Bhagoji,et al\.\(2021\)Advances and open problems in federated learning\.Foundations and Trends in Machine Learning14\(1–2\),pp\. 1–210\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1),[§8](https://arxiv.org/html/2606.11196#S8.SS0.SSS0.Px1.p1.1)\.
- \[19\]S\. D\. Kamvar, M\. T\. Schlosser, and H\. Garcia\-Molina\(2003\)The EigenTrust algorithm for reputation management in P2P networks\.InProceedings of the 12th International Conference on World Wide Web,pp\. 640–651\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[20\]S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne,et al\.\(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p3.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. Kim, J\. Suk, S\. Longpre, B\. Y\. Lin, J\. Shin, S\. Welleck, G\. Neubig, M\. Lee, K\. Lee, and M\. Seo\(2024\)Prometheus 2: an open source language model specialized in evaluating other language models\.arXiv preprint arXiv:2405\.01535\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p3.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[22\]Y\. Kim\(2014\)Convolutional neural networks for sentence classification\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,pp\. 1746–1751\.Cited by:[1st item](https://arxiv.org/html/2606.11196#S1.I1.i1.p1.1),[§3\.1](https://arxiv.org/html/2606.11196#S3.SS1.SSS0.Px1.p1.2),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px3.p1.1)\.
- \[23\]W\. Kryscinski, B\. McCann, C\. Xiong, and R\. Socher\(2020\)Evaluating the factual consistency of abstractive text summarization\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,pp\. 9332–9346\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p7.2),[§4\.1](https://arxiv.org/html/2606.11196#S4.SS1.SSS0.Px1.p1.2),[§5\.2](https://arxiv.org/html/2606.11196#S5.SS2.p3.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px2.p2.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[24\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with PagedAttention\.InProceedings of the 29th Symposium on Operating Systems Principles \(SOSP ’23\),Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1)\.
- \[25\]P\. Laban, T\. Schnabel, P\. N\. Bennett, and M\. A\. Hearst\(2022\)SummaC: re\-visiting NLI\-based models for inconsistency detection in summarization\.Transactions of the Association for Computational Linguistics10,pp\. 163–177\.Cited by:[§5\.2](https://arxiv.org/html/2606.11196#S5.SS2.p3.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px2.p2.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[26\]P\. Liang, R\. Bommasani,et al\.\(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Cited by:[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p4.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px6.p1.1)\.
- \[27\]C\. Lin\(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,pp\. 74–81\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p7.2),[§4\.1](https://arxiv.org/html/2606.11196#S4.SS1.SSS0.Px1.p1.2),[§5\.2](https://arxiv.org/html/2606.11196#S5.SS2.p3.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px2.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[28\]Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu\(2023\)G\-Eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 2511–2522\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p3.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.
- \[29\]A\. R\. Lommel, A\. Burchardt, and H\. Uszkoreit\(2013\)Multidimensional quality metrics: a flexible system for assessing translation quality\.InProceedings of Translating and the Computer 35,Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px3.p1.1)\.
- \[30\]N\. Miller, P\. Resnick, and R\. Zeckhauser\(2005\)Eliciting informative feedback: the peer\-prediction method\.Management Science51\(9\),pp\. 1359–1373\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[31\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\)BLEU: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[32\]B\. Parno, J\. Howell, C\. Gentry, and M\. Raykova\(2013\)Pinocchio: nearly practical verifiable computation\.In2013 IEEE Symposium on Security and Privacy,Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px1.p1.9),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1)\.
- \[33\]P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang\(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,pp\. 2383–2392\.Cited by:[§3\.2](https://arxiv.org/html/2606.11196#S3.SS2.SSS0.Px2.p1.1),[1st item](https://arxiv.org/html/2606.11196#S4.I1.i1.p1.1)\.
- \[34\]V\. C\. Raykar, S\. Yu, L\. H\. Zhao, G\. H\. Valadez, C\. Florin, L\. Bogoni, and L\. Moy\(2010\)Learning from crowds\.Journal of Machine Learning Research11,pp\. 1297–1322\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[35\]R\. Rei, C\. Stewart, A\. C\. Farinha, and A\. Lavie\(2020\)COMET: a neural framework for MT evaluation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,pp\. 2685–2702\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[36\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[2nd item](https://arxiv.org/html/2606.11196#S1.I1.i2.p1.1),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p1.3),[§3\.1](https://arxiv.org/html/2606.11196#S3.SS1.SSS0.Px2.p1.2),[§4\.3](https://arxiv.org/html/2606.11196#S4.SS3.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[37\]T\. Sellam, D\. Das, and A\. Parikh\(2020\)BLEURT: learning robust metrics for text generation\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 7881–7892\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[38\]S\. Shalev\-Shwartz\(2012\)Online learning and online convex optimization\.Foundations and Trends in Machine Learning4\(2\),pp\. 107–194\.Cited by:[§3\.4](https://arxiv.org/html/2606.11196#S3.SS4.SSS0.Px2.p4.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px6.p1.1)\.
- \[39\]A\. Tian, A\. Ding, F\. Chen, A\. Wu, A\. Chan, and B\. Zhang\(2025\)Design and evaluation of cost\-aware PoQ for decentralized LLM inference\.arXiv preprint arXiv:2512\.16317\.External Links:[Link](https://arxiv.org/abs/2512.16317)Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px2.p1.4),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.11196#S2.T1.1.2.1.1.1.1),[§2](https://arxiv.org/html/2606.11196#S2.p1.1),[§3\.5](https://arxiv.org/html/2606.11196#S3.SS5.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.11196#S4.SS1.SSS0.Px1.p1.2),[§4\.2](https://arxiv.org/html/2606.11196#S4.SS2.p1.1),[§4\.4](https://arxiv.org/html/2606.11196#S4.SS4.SSS0.Px5.p1.2),[§4](https://arxiv.org/html/2606.11196#S4.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1),[§8](https://arxiv.org/html/2606.11196#S8.p3.1)\.
- \[40\]A\. Tian, A\. Ding, F\. Chen, S\. Wu, and A\. Chan\(2026\)A multi\-dimensional quality scoring framework for decentralized LLM inference with proof of quality\.arXiv preprint arXiv:2603\.04028\.External Links:[Link](https://arxiv.org/abs/2603.04028)Cited by:[3rd item](https://arxiv.org/html/2606.11196#S1.I2.i3.p1.1),[§1](https://arxiv.org/html/2606.11196#S1.p2.1),[§1](https://arxiv.org/html/2606.11196#S1.p3.1),[§1](https://arxiv.org/html/2606.11196#S1.p5.2),[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px4.p1.2),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p1.3),[Table 1](https://arxiv.org/html/2606.11196#S2.T1.1.4.3.1.1.1),[§2](https://arxiv.org/html/2606.11196#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.11196#S3.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.11196#S4.SS1.SSS0.Px1.p1.2),[§4\.3](https://arxiv.org/html/2606.11196#S4.SS3.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.11196#S4.p1.1),[§5\.1](https://arxiv.org/html/2606.11196#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.11196#S5.SS2.p3.1),[§5\.3](https://arxiv.org/html/2606.11196#S5.SS3.p3.2),[§5\.4](https://arxiv.org/html/2606.11196#S5.SS4.p4.3),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px2.p2.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px4.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1),[§8](https://arxiv.org/html/2606.11196#S8.p2.1),[§8](https://arxiv.org/html/2606.11196#S8.p3.1)\.
- \[41\]A\. Tian, A\. Ding, F\. Chen, S\. Wu, and A\. Chan\(2026\)Adaptive and robust cost\-aware proof of quality for decentralized LLM inference networks\.arXiv preprint arXiv:2601\.21189\.External Links:[Link](https://arxiv.org/abs/2601.21189)Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.11196#S2.T1.1.3.2.1.1.1),[§2](https://arxiv.org/html/2606.11196#S2.p1.1),[§4\.4](https://arxiv.org/html/2606.11196#S4.SS4.SSS0.Px5.p1.2),[§4](https://arxiv.org/html/2606.11196#S4.p1.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px6.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1),[§8](https://arxiv.org/html/2606.11196#S8.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2606.11196#S8.p3.1)\.
- \[42\]D\. Yin, Y\. Chen, R\. Kannan, and P\. Bartlett\(2018\)Byzantine\-robust distributed learning: towards optimal statistical rates\.InProceedings of the 35th International Conference on Machine Learning,Vol\.80,pp\. 5650–5659\.Cited by:[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px3.p1.2),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px5.p1.1)\.
- \[43\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2020\)BERTscore: evaluating text generation with BERT\.InInternational Conference on Learning Representations,Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[44\]Z\. Zhang, Y\. Rao, H\. Xiao, X\. Xiao, and Y\. Yang\(2024\)Proof of quality: a costless paradigm for trustless generative AI model inference on blockchains\.arXiv preprint arXiv:2405\.17934\.External Links:[Link](https://arxiv.org/abs/2405.17934)Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11196#S2.SS1.SSS0.Px1.p1.9),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px4.p1.1)\.
- \[45\]W\. Zhao, M\. Peyrard, F\. Liu, Y\. Gao, C\. M\. Meyer, and S\. Eger\(2019\)MoverScore: text generation evaluating with contextualized embeddings and earth mover distance\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 563–578\.Cited by:[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px2.p1.1)\.
- \[46\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2606.11196#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.11196#S2.SS2.p3.1),[§6](https://arxiv.org/html/2606.11196#S6.SS0.SSS0.Px6.p1.1),[§7](https://arxiv.org/html/2606.11196#S7.SS0.SSS0.Px1.p1.1)\.Similar Articles
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs
This paper demonstrates that small open-weight LLMs (<30B parameters) can achieve competitive interpretable translation quality estimation, including MQM error annotations and corrections, rivaling much larger proprietary models while preserving data privacy.
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
RankJudge is a benchmark generator that creates paired multi-turn conversations with injected flaws to evaluate LLM judges on their ability to correctly identify better and worse responses in complex dialogues.
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
This paper studies the relationship between token-level log-probability distributions, LLM-as-judge rubric scores, and final task accuracy in multi-agent debate systems. It finds a consistent four-phase confidence trajectory and role asymmetry between Constructor and Auditor agents.