@omarsar0: If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use L…
Summary
BinEval is a new framework that decomposes LLM evaluation criteria into atomic binary questions, improving interpretability and enabling targeted prompt optimization, achieving strong results on factual consistency benchmarks.
View Cached Full Text
Cached at: 06/28/26, 01:54 AM
If you use LLM-as-judge, this one is worth reading.
(bookmark it)
It’s actually one of the most effective ways to use LLM-as-a-Judge for evals.
Holistic judge scores hide both their reasoning and their ceiling effects.
BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores.
Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal.
Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency.
Paper: https://arxiv.org/abs/2606.27226
Learn to build effective AI agents in our academy: https://academy.dair.ai
Ask, Don’t Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement
Source: https://arxiv.org/html/2606.27226 Kushal ChawlaPengshan CaiZefang LiuChenyang ZhuShi-Xiong ZhangSambit Sahu
Abstract
Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We proposeBinEval, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS,BinEvalmatches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments,BinEvalbetter matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall,BinEvalprovides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.
large language models, evaluation, prompt optimization, interpretability
1Introduction
The rapid progress of large language models (LLMs) has made generation easy and evaluation hard. Modern systems can produce fluent, contextually appropriate outputs across tasks such as summarization, dialogue, reasoning, and instruction following, but evaluating those outputs remains a major bottleneck. Human evaluation is slow and expensive, lexical metrics such as ROUGE(Lin,2004), BLEU(Papineniet al.,2002), and BERTScore(Zhanget al.,2020)miss semantic correctness and factuality, and holistic LLM judges(Zhenget al.,2023; Liuet al.,2023)often return opaque scores that are difficult to diagnose.
This bottleneck is especially costly in iterative development. Comparing prompts, models, or decoding strategies requires feedback that is not only accurate but also actionable. A single scalar score is often insufficient: if a summary receives a mediocre rating, it is still unclear whether the problem is factual inconsistency, weak relevance, missing content, or poor fluency.
Our premise is simple: instead of asking a model for one broad judgment, ask it a set of small, checkable questions. We therefore proposeBinEval, which decomposes each evaluation criterion into atomic yes/no questions and aggregates the resulting verdicts into interpretable scores. This decomposition turns evaluation from a black-box verdict into a structured diagnostic signal, making it easier to inspect, debug, and improve both evaluators and generators.
BinEvalhas three components. First, a meta-prompt decomposes a task prompt into atomic questions organized by evaluation dimension. Second, an evaluator answers each question independently and aggregates the answers into per-dimension and overall scores. Third, a two-phase optimization loop improves both evaluator prompts and generation prompts using question-level feedback.
We evaluateBinEvalon SummEval(Fabbriet al.,2021), Topical-Chat(Mehri and Eskenazi,2020), and QAGS(Wanget al.,2020), and we study iterative prompt updating on summarization and IFBench.
Our contributions are:
- •A general framework for interpretable evaluation.We decompose evaluation criteria into atomic yes/no questions, yielding a task-agnostic and modular method.
- •Strong performance without task-specific training.BinEvalmatches or exceeds trained evaluators and holistic LLM judges on SummEval, Topical-Chat, and QAGS.
- •Iterative prompt improvement.We introduce a two-phase optimization loop that improves prompts for both summarization and IFBench.
- •Debuggable scores.EachBinEvalscore is grounded in individual verdicts with explanations, making evaluator behavior easier to inspect and diagnose.
2Related Work
Traditional Evaluation Metrics.Lexical overlap metrics–ROUGE(Lin,2004), BLEU(Papineniet al.,2002), and METEOR(Banerjee and Lavie,2005)–remain standard for summarization and translation evaluation, but they often struggle to capture semantic equivalence in open-ended generation. Embedding-based metrics such as BERTScore(Zhanget al.,2020)and MoverScore(Zhaoet al.,2019)improve semantic matching by operating in representation space, while generation-based metrics like BARTScore(Yuanet al.,2021)frame evaluation as text generation. More recent reference-free methods like ParaPLUIE(Lemesleet al.,2025)measure meaning preservation using model perplexity without requiring gold references, and frameworks like OmniScore(Alamet al.,2026)use deterministic learned evaluators to support scalable multilingual assessment.
LLM-as-Judge.Recent work has increasingly leveraged LLMs themselves as evaluators. G-Eval(Liuet al.,2023)uses chain-of-thought reasoning followed by a Likert-scale rating, while AlpacaEval(Liet al.,2023)and MT-Bench / Chatbot Arena(Zhenget al.,2023)rely on pairwise or preference-based judgments. The paradigm has also expanded to specialized open-source evaluators such as Prometheus 2(Kimet al.,2024), which approximates the depth of human and proprietary model judgments. However, these judges remain susceptible to position, verbosity, and self-enhancement biases(Zhenget al.,2023). Recent benchmarks like JudgeBiasBench(Zhouet al.,2026)further systematize these concerns by providing a taxonomy of judge biases and proposing debiasing strategies.
Multi-Dimensional Evaluation.Multi-dimensional evaluation aims to decompose quality into interpretable facets such as coherence, faithfulness, informativeness, and relevance. UniEval(Zhonget al.,2022)is a key prior example: it reformulates evaluation as Boolean question answering and fine-tunes a T5-based evaluator for multiple dimensions. More recent work similarly decomposes evaluation into facets like informativeness and faithfulness(Alamet al.,2026), while hybrid frameworks such as QAEval(Yueet al.,2025)combine rule-based reliability with a Mixture of Evaluators for open-ended generation tasks. Together, these methods reinforce the value of breaking evaluation into smaller, more structured judgments.
Atomic Decomposition for Evaluation.FActScore(Minet al.,2023)pioneered the “decompose-then-verify” paradigm by breaking long-form generations into atomic facts and verifying them individually. Related frameworks such as ARES(Saad-Falconet al.,2024)and RAGAS(Eset al.,2024)extend similar decomposition ideas to retrieval-augmented generation, while OpenFActScore(Lage and Ostermann,2025)enables open-source fact-checking with atomic evaluation. These approaches demonstrate that fine-grained decomposition can improve factual assessment, although they typically decompose generated content rather than evaluation criteria themselves.
Prompt Optimization.Prompt optimization has increasingly shifted from manual instruction engineering toward automated and programmatic refinement. DSPy(Khattabet al.,2023)provides a framework for declarative, self-improving language-model pipelines, and algorithms like MIPRO(Opsahl-Onget al.,2024)perform Bayesian search over instructions and demonstrations. OPRO(Yanget al.,2023)and APE(Zhouet al.,2023)likewise use language models to iteratively generate and refine prompts. More recent methods such as MARS(Zhanget al.,2025)introduce multi-agent Socratic optimization, while LLM-AutoDiff(Yin and Wang,2025)treats textual inputs as trainable parameters in graph-structured workflows. These methods motivate our use of disagreement-driven prompt refinement as a targeted optimization signal.
3Method
We presentBinEvalin three parts: binary question generation (Section3.1), binary evaluation and scoring (Section3.2), and iterative prompt optimization (Sections3.3and3.4).
3.1Binary Question Generation
LetTTdenote a task prompt defining the generation requirements, such as a summarization instruction, a dialogue system prompt, or an instruction-following specification. We define adecomposition functionthat mapsTTto a set of binary questions:
𝒬=ℱLLM(T;M)={q1,q2,…,qN}.\mathcal{Q}=\mathcal{F}_{\text{LLM}}(T;M)=\{q_{1},q_{2},\dots,q_{N}\}.whereMMis a meta-prompt that instructs an LLM to perform a two-step decomposition.
Step 1 – Summarize.We first summarize the task promptTTinto an explicit set of requirementsℛ={r1,r2,…,rK}\mathcal{R}=\{r_{1},r_{2},\dots,r_{K}\}. Each requirementrkr_{k}captures a distinct evaluation criterion, such as whether the output includes a key piece of information or obeys a formatting constraint. This summarization step is intended to help the model form a coherent representation of the full task before attempting finer-grained decomposition.
Step 2 – Decompose.For each requirementrkr_{k}, we generate one or more binary questions such that answering “yes” indicates the output satisfies the requirement and answering “no” indicates a violation. Requirements that implicitly contain multiple sub-tasks are decomposed into separate questions, and each question is paired with a concise violation example to clarify the negative case. This design is motivated by prior work showing that complex reasoning is often improved by decomposing a task into simpler sub-problems that can be solved sequentially or modularly(Zhouet al.,2022; Khotet al.,2022). In our setting, the same intuition suggests that evaluation becomes easier when the model answers targeted binary questions about simplified sub-tasks rather than making a single holistic judgment.
The questions can be organized into evaluation dimensions. For a set of dimensions𝒟\mathcal{D}, such as coherence, consistency, fluency, and relevance, the questions partition as
𝒬=⋃d∈𝒟𝒬d,\mathcal{Q}=\bigcup_{d\in\mathcal{D}}\mathcal{Q}_{d},where𝒬d\mathcal{Q}_{d}contains questions specific to dimensiondd. The meta-promptMMis task-agnostic: the same meta-prompt generates appropriate binary questions for summarization, dialogue, instruction following, or any other task, with onlyTTchanging.
3.2Binary Evaluation and Scoring
Given an evaluator LLMEE, an inputxxsuch as a source document, a transcript, or an instruction, an outputyysuch as a generated summary, a dialogue response, or a completion, and a binary questionqiq_{i}, we define thebinary evaluation function
fE(x,y,qi)∈{0,1},f_{E}(x,y,q_{i})\in\{0,1\},wherefE(x,y,qi)=1f_{E}(x,y,q_{i})=1if the evaluator answers “yes” and0otherwise. Alongside each binary verdict, the evaluator produces a natural-language explanationeie_{i}, enabling interpretability.
The per-dimension score for dimensionddis
Sd(x,y)=1|𝒬d|∑qi∈𝒬dfE(x,y,qi).S_{d}(x,y)=\frac{1}{|\mathcal{Q}_{d}|}\sum_{q_{i}\in\mathcal{Q}_{d}}f_{E}(x,y,q_{i}).The overall score across allNNquestions is
S(x,y)=1N∑i=1NfE(x,y,qi).S(x,y)=\frac{1}{N}\sum_{i=1}^{N}f_{E}(x,y,q_{i}).Both scores lie in[0,1][0,1], where 1 indicates all criteria are satisfied. To enable comparison with existing evaluation frameworks that use different scales, the scores can be mapped from[0,1][0,1]to any target interval[a,b][a,b]via affine scaling:
S′(x,y)=S(x,y)⋅(b−a)+a.S^{\prime}(x,y)=S(x,y)\cdot(b-a)+a.
3.3Cross-Model Prompt Update
BinEval’s binary question framework enables cross-model prompt update between evaluators. The key insight is that disagreements between a source evaluator and a target evaluator on specific binary questions provide a fine-grained signal for improvement: unlike holistic score differences, binary question disagreements identify exactly which criteria are being judged inconsistently across models. This makes it possible to use a stronger source model as a reference and iteratively update the prompt of a different, typically weaker, target model until its evaluator behavior matches the source more closely. Moreover, it is useful for updating prompts to maintain similar performance when migrating a model to a different family of models.
LetEsrcE_{\text{src}}denote a source evaluator, treated as the reference model, and letEtgtE_{\text{tgt}}denote a target evaluator whose promptPEP_{E}we wish to improve. LetPE(t)P_{E}^{(t)}denote the target evaluator’s prompt at iterationtt.
At each iterationtt, the optimization proceeds in five steps:
- 1.Evaluate.For each test case(xj,yj)(x_{j},y_{j}), obtain binary evaluations from both models: Ajsrc\displaystyle A_{j}^{\text{src}}={fEsrc(xj,yj,qi)}i=1N,\displaystyle=\{f_{E_{\text{src}}}(x_{j},y_{j},q_{i})\}_{i=1}^{N},Ajtgt\displaystyle A_{j}^{\text{tgt}}={fEtgt(xj,yj,qi;PE(t−1))}i=1N.\displaystyle=\{f_{E_{\text{tgt}}}(x_{j},y_{j},q_{i};P_{E}^{(t-1)})\}_{i=1}^{N}.
- 2.Identify disagreements.Compute the set of questions on which the evaluators disagree: Δj={qi∈𝒬:Ajsrc(qi)≠Ajtgt(qi)}.\Delta_{j}=\{q_{i}\in\mathcal{Q}:A_{j}^{\text{src}}(q_{i})\neq A_{j}^{\text{tgt}}(q_{i})\}.
- 3.Extract lessons.A note-taker LLMLnoteL_{\text{note}}analyzes each disagreement in context, extracting generalized lessons: ℒj=Lnote(xj,yj,Ajsrc,Ajtgt,Δj).\mathcal{L}_{j}=L_{\text{note}}(x_{j},y_{j},A_{j}^{\text{src}},A_{j}^{\text{tgt}},\Delta_{j}).Dedup(ℓnew,ℳ)={merge(ℓnew,ℓk),ifℓnew∼ℓkadd(ℓnew),otherwise.\text{Dedup}(\ell_{\text{new}},\mathcal{M})=\begin{cases}\text{merge}(\ell_{\text{new}},\ell_{k}),&\text{if }\ell_{\text{new}}\sim\ell_{k}\\ \text{add}(\ell_{\text{new}}),&\text{otherwise.}\end{cases}The final set of unique lessons isℒunique=Dedup(⋃jℒj)\mathcal{L}_{\text{unique}}=\text{Dedup}(\bigcup_{j}\mathcal{L}_{j}).
- 4.Update prompt.For each unique lessonℓk∈ℒunique\ell_{k}\in\mathcal{L}_{\text{unique}}, an updater LLM identifies the relevant substringsks_{k}in the current prompt and produces a revised substringsk′s^{\prime}_{k}that incorporates the lesson: PE(t)←PE(t).replace(sk,sk′).P_{E}^{(t)}\leftarrow P_{E}^{(t)}.\text{replace}(s_{k},s^{\prime}_{k}).
The loop terminates when the target evaluator’s scores match the source evaluator’s scores within a toleranceϵ\epsilonacross all dimensions:
|Sdtgt,(t)−Sdsrc|<ϵ∀d∈𝒟,|S_{d}^{\text{tgt},(t)}-S_{d}^{\text{src}}|<\epsilon\qquad\forall d\in\mathcal{D},or equivalently, when the target evaluator meets or exceeds the source evaluator on all dimensions. The full algorithm is shown in Appendix1.
3.4Self Prompt Update
The same binary question framework can also be used for self prompt update in generation. Instead of aligning one evaluator to another model, this procedure iteratively improves a generator by using evaluator-identified failures as feedback on its own outputs. Given a generation LLMLGL_{G}with promptPG(t)P_{G}^{(t)}at iterationtt:
- 1.Generate.Produce outputs using the current prompt:yj(t)=LG(xj;PG(t))y_{j}^{(t)}=L_{G}(x_{j};P_{G}^{(t)}).
- 2.Evaluate.Score each output using the potentially already-improved evaluator and collect failing questions: ℰj={(qi,ei):fE(xj,yj(t),qi)=0},\mathcal{E}_{j}=\{(q_{i},e_{i}):f_{E}(x_{j},y_{j}^{(t)},q_{i})=0\},whereeie_{i}is the evaluator’s explanation for the failure.
- 3.Extract lessons.A note-taker LLM analyzes the evaluation errors in context:ℒj=Lnote(xj,yj(t),ℰj)\mathcal{L}_{j}=L_{\text{note}}(x_{j},y_{j}^{(t)},\mathcal{E}_{j}).
- 4.Deduplicate and update.Apply the same semantic deduplication and prompt rewriting procedure used for evaluator optimization, but now toPGP_{G}.
The generation loop terminates when no evaluation errors remain or when the maximum number of iterations is reached.
4Experimental Setup
We design two complementary sets of experiments. Part I evaluatesBinEval’s performance on established benchmarks with human annotations. Part II demonstrates the iterative prompt-updating mechanism on both an unverifiable task and a verifiable task. Across these experiments, we use gpt-oss-120b and Claude Sonnet 4. To reduce randomness on LLM responses, we set the temperature to0in all experiments and report the average over two runs.
4.1Metrics
For evaluation quality, we report Spearman’s rank correlation (ρ\rho), Kendall’s rank correlation (τ\tau), and Pearson correlation (rr) between method scores and human judgments at the summary level.
4.2Part I: Evaluation Quality Validation
We follow the evaluation protocol of UniEval(Zhonget al.,2022)and evaluate on three established benchmarks.
SummEval.(Fabbriet al.,2021)A benchmark of 100 CNN/DM(Seeet al.,2017)source articles, each summarized by 16 different summarization models, yielding 1,600 summary-level annotations. Human evaluators rated each summary on four dimensions:fluency,coherence,consistency, andrelevance. Ratings are on a 1–5 Likert scale.
Topical-Chat.(Mehri and Eskenazi,2020)A benchmark of 60 dialogue responses generated by 6 dialogue models, annotated on six dimensions:naturalness,coherence,engagingness,groundedness,understandability, and anoverallquality rating. Following Zhong et al.(Zhonget al.,2022), we use four of these aspects.
QAGS.(Wanget al.,2020)A benchmark specifically targeting hallucination evaluation in summarization, comprising 235 samples from CNN/DM and 239 from XSum(Narayanet al.,2018). Annotators rated theconsistencyof each summary with respect to its source document.
4.3Part II: Iterative Prompt Updating
We evaluateBinEval’s iterative prompt update mechanism (Algorithm1) on two tasks:evaluator prompt optimizationon SummEval, which is unverifiable in the sense that there is no programmatic gold checker, andgeneration prompt optimizationon IFBench(Pyatkinet al.,2025), which is verifiable via executable constraint checkers. For SummEval, we test two update modes:self-update, where a single model (gpt-oss-120b) improves its own evaluator prompt using failures against human judgments, andcross-model update, where a stronger model (Claude Sonnet 4) serves as the reference evaluator and lessons from disagreements are used to update the target model’s prompt. See AppendixBfor detailed experimental setups.
5Results
5.1Evaluation Quality: SummEval
Table 1:Summary-level Spearmanρ\rho/ Kendallτ\taucorrelations on SummEval.Table1shows a clear ranking across evaluation paradigms.BinEval(Claude) is the strongest method overall, achieving the best average Spearman and Kendall correlations and leading on coherence, consistency, and fluency. The largest gain is on consistency, whereBinEvalreaches 0.655 / 0.615, suggesting that decomposing factual quality into multiple targeted checks is especially effective for summary evaluation. Relevance remains the main exception: G-Eval (GPT-4) is best on that dimension, indicating that some broader semantic judgments are still harder to capture with binary decomposition.
The additional gpt-oss runs clarify why decomposition matters. Under the same backbone,BinEval(gpt-oss) outperforms both G-Eval (gpt-oss) and UniEval (gpt-oss) on average, driven by large gains on coherence and consistency. G-Eval with gpt-oss remains viable on numeric-scale dimensions such as consistency and relevance, but its fluency performance collapses. UniEval with gpt-oss is weaker still, with near-zero fluency correlation, showing that a single yes/no question is often too coarse for a general-purpose model. Overall, SummEval supports the core claim of the paper: multiple binary questions provide a more robust and transferable evaluation signal than either a single holistic score or a single Boolean judgment.
Figure 1:Per-dimension score distributions on SummEval.BinEvalshows its strongest correlation on consistency. Its distribution is closest to the human shape while still preserving useful spread; it also remains competitive on coherence and fluency, even when its calibration is slightly more conservative than human ratings.
Figure 2:Per-system average-score distributions on SummEval. Across the 16 summarization systems,BinEval(Claude) best tracks the relative ordering of systems, while the weaker baselines produce flatter and less discriminative score patterns.Figure1gives a more nuanced view of these gains. The figure presents violin plots of score distributions on SummEval across four evaluation dimensions comparing human annotations with different methods.BinEvalis visually closest to the human distributions on consistency, where it largely matches the human concentration near the upper end while still retaining some low-scoring mass; this mirrors its largest correlation advantage in Table1. Across dimensions,BinEval(Claude) is generally among the methods most closely aligned with human judgments in central tendency and spread, with its strongest match on consistency. UniEval and G-Eval exhibit narrower, more concentrated distributions, suggesting weaker discrimination across systems. The gpt-oss-based variants consistently underestimate scores relative to humans, especially on coherence and relevance, whereBinEval(gpt-oss) and G-Eval (gpt-oss) show visibly lower means. Fluency is tightly clustered near the ceiling for all methods, reflecting the generally high fluency of modern summarization systems and the limited variance of this dimension. Notably, UniEval (gpt-oss) yields a near-degenerate fluency distribution, indicating its inability to differentiate quality along this axis. Overall,BinEval’s main strength is not perfect calibration on every dimension, but its ability to preserve meaningful relative variation, especially for factual consistency.
Figure2provides the same comparison at the system level, where each score is averaged across the four SummEval dimensions and the 16 systems are ordered by ascending human mean.BinEval(Claude) tracks the human ranking most faithfully, preserving the monotonic trend from weaker to stronger systems while maintaining visible separation among mid- and low-performing models. By contrast, UniEval and G-Eval exhibit more compressed score ranges that attenuate differences between systems, especially in the middle of the ranking. The gpt-oss-based methods are generally more conservative in absolute score level, but they still recover much of the broad system ordering. Another clear pattern is distributional width:BinEvalvariants tend to show wider, more human-like within-system variance, whereas UniEval and G-Eval produce tighter violins that may understate genuine score variability. Agreement across methods is strongest for the highest human quality systems (rightmost), while the lower-quality systems show larger divergence, suggesting that distinguishing poor from mediocre summaries remains a challenge for automated evaluation methods.
5.2Evaluation Quality: Topical-Chat
The dialogue results show thatBinEvaltransfers effectively beyond summarization.BinEval(Claude) achieves the best average Spearman correlation on Topical-Chat (0.632), with especially strong gains on naturalness and engagingness, whileBinEval(gpt-oss) remains competitive with G-Eval (gpt-oss) and substantially stronger than UniEval (gpt-oss). These results suggest that decomposing dialogue quality into multiple concrete questions is particularly helpful for subjective conversational criteria. Detailed results are provided in AppendixD.1.
5.3Evaluation Quality: QAGS
QAGS highlights the advantage of decomposition most clearly.BinEval(Claude) achieves the best average Spearman correlation (0.620), and evenBinEval(gpt-oss) substantially outperforms G-Eval (gpt-oss), whose binary prompt produces too little score granularity for reliable ranking. This suggests that decomposing factual consistency into several targeted questions is much more robust than relying on a single holistic or yes/no judgment, especially on hallucination-prone data such as XSum. Detailed results and discussion are provided in AppendixD.2.
5.4Iterative Prompt Update
5.4.1SummEval: Evaluator Prompt Update
Table2reports test-set Spearmanρ\rhounder iterative prompt update on the four SummEval dimensions. Both update modes improve three of the four dimensions. Self-update yields the largest single-dimension gain on fluency (+0.119), where the baseline prompt is especially weak and iterative refinement of both the evaluator rubric and the generated binary questions substantially improves alignment with human judgments. Cross-model update is strongest on consistency (+0.136), which is consistent with the idea that a stronger reference evaluator provides especially useful guidance for factual verification. Averaged across dimensions, self-update improves by +0.075, while cross-model update improves by +0.070.
Relevance resists improvement under both update modes. Inspecting the updated prompts suggests that lesson-driven refinements tend to over-decompose relevance into overly granular requirements, such as separate checks for every actor, motivation, and background event. These refinements make the evaluator more severe than human annotators rather than better aligned with them, which suggests that relevance remains a comparatively holistic judgment and is less amenable to fine-grained binary decomposition than dimensions with more concrete failure modes.
Three observations stand out. First, the two update modes are complementary: self-update helps most on coherence and fluency, while cross-model update helps most on consistency, indicating that human-score divergence and inter-model disagreement surface different classes of evaluator error. Second, most gains appear within the first one or two iterations; later iterations are more likely to degrade the prompt as lessons accumulate into competing instructions. Third, binary question regeneration is critical: the largest gains occur in iterations that alter not only the evaluator prompt but also the induced question decomposition, reinforcing that question design is itself a key lever for evaluation quality.
Table 2:Evaluator prompt update on SummEval. Test-set Spearmanρ\rhowith human judgments;Δ\Deltais the absolute improvement over the baseline. The best iteration is selected by early stopping on test performance.
5.4.2IFBench: Generation Prompt Update
Table3presents strict test-set accuracy on IFBench across prompt-update iterations. Self-update achieves a modest improvement, peaking at38.0%38.0\%at iteration 3, which is a gain of+3.4+3.4percentage points over its own iteration-0 baseline. However, the same run collapses by iteration 4, illustrating the fragility of repeated prompt rewriting. Cross-model update shows no improvement and in fact declines after the first update step, suggesting that the stronger judge’s stricter standard can overcorrect the prompt rather than refine it.
The per-category breakdown in Table4reveals a sharp divide betweenpromptableandcomputationalconstraints. Format and sentence constraints improve substantially, each by 17 percentage points, indicating that these tasks are often solved once the model is given clearer structural guidance. By contrast, count, ratio, words, and repeat constraints show little or no improvement. These constraints require precise computation during generation, such as maintaining counts, enforcing ratios, or filtering words by syllabic or lexical criteria. The extracted lessons often diagnose these failures correctly, but instructions such as “maintain an internal counter” do not endow the model with new computational ability. Instead, they accumulate into prompt bloat, which eventually harms even categories that were previously working well.
The main takeaway is that iterative prompt update is effective when the model already has the relevant capability but needs better guidance to express it. It is much less effective when failures reflect an underlying capability limitation rather than a prompting problem. In these cases,BinEvalstill provides accurate diagnoses, but the resulting fixes are largely unactionable and can degrade performance through instruction overload.
5.5Case Study
AppendixApresents both evaluation and prompt-update examples. It includes four SummEval case studies, one per dimension, showing thatBinEvalcan recognize coherence in a one-sentence summary, identify subtle factual errors, assign partial credit to garbled text, and separate incompleteness from irrelevance. The appendix also includes SummEval prompt-update examples for self-update and cross-model update, a relevance failure case where over-decomposition hurts alignment with human judgments, and an IFBench example highlighting the boundary between promptable failures and underlying computational limits. Together, these examples show that decomposition yields more justifiable scores and helps diagnose when prompt refinement succeeds or fails.
Table 3:Generation prompt update on IFBench (test-set strict accuracy, %).Table 4:IFBench per-category accuracy (%) under self-update.
5.6Why Does Decomposition Work?
Whydoes evaluating through multiple atomic binary questions outperform a single holistic judgment? We identify three contributing mechanisms and examine the evidence for each on SummEval (see AppendixEfor the full question sets).
Complexity Reduction.Each binary question isolates a single verifiable property, replacing one multi-faceted judgment with many simpler ones—mirroring the benefits of task decomposition in prompting(Zhouet al.,2022; Khotet al.,2022). A question like “Are all named entities accurately represented?” is easier to answer reliably than “Rate factual consistency from 1–5.” On consistency, the seven targeted questions yield yes-rates spread between 0.75 and 0.95 (Table10), indicating each captures a distinct difficulty level. This pattern holds across dimensions: fluency, relevance, and coherence show yes-rate spreads of 0.48, 0.46, and 0.86 respectively.
Variance Reduction via Aggregation.AggregatingNNweakly correlated binary classifiers reduces variance proportionally to1/N1/N.Figure3shows this mechanism varies by dimension: relevance and coherence have the lowest mean inter-question correlations (ϕ=0.20\phi=0.20and0.280.28; 80% and 64% of pairs with|ϕ|<0.3|\phi|<0.3), while fluency is moderate (ϕ=0.39\phi=0.39; e.g., spelling Q2 vs. punctuation Q3 atϕ=0.02\phi=0.02). Consistency is the exception (ϕ=0.58\phi=0.58, zero weak pairs), where questions like “free of factual errors” and “no misrepresentation” are inherently related (ϕ=0.79\phi=0.79).
Coverage of Failure Modes.Decomposition forces explicit enumeration of criteria, improving recall over holistic judgments. In fluency, spelling (Q2) and punctuation (Q3) are nearly uncorrelated (ϕ=0.02\phi=0.02) with different yes-rates (0.71 vs. 0.33), catching disjoint failures. Relevance Q1 (main topic, 0.95) and Q3 (redundancy, 0.64) showϕ=0.01\phi=0.01. Consistency again is weakest: its least correlated pair hasϕ=0.32\phi=0.32.
Figure 3:Pairwise phi-coefficient correlation matrices within each SummEval dimension. Low off-diagonal values indicate questions capture distinct aspects of the dimension. Mean off-diagonalϕ\phiacross all dimensions is0.380.38. See AppendixEfor question definitions.Dimension-Level Summary.The three mechanisms contribute unequally. Relevance and coherence exhibit strong variance reduction and coverage. Fluency benefits from all three. Consistency is the most instructive: weakest variance reduction and coverage, yet the largest gain over UniEval (+0.195 Spearmanρ\rho), suggesting that complexity reduction alone—decomposing factual verification into targeted sub-checks—can be the dominant driver. From a practical standpoint, practitioners can inspect generated questions for these properties (yes-rate spread, inter-question correlation, pairwise coverage) to anticipate where decomposition will help most and where refinement is needed.
6Discussion
Failure Modes.Decomposition works best for concrete criteria such as factual consistency, where errors can be tied to specific claims or entities and can therefore be checked with relatively clear yes/no decisions. It is less reliable for subjective qualities, where human judgments are more holistic and less reducible to a set of binary checks. In such cases, the quality of the evaluation depends heavily on whether the generated questions capture the aspects that humans actually weigh when forming an overall judgment. The appendix shows both patterns: relevance can degrade when decomposition becomes too strict, and prompt update helps less when failures reflect the model’s base capability rather than its instructions. On IFBench, clearer prompts help with format and sentence-level constraints but not with counting or ratio tracking, suggesting that some errors stem from execution limits rather than task specification alone.
Computational Cost.BinEvaltrades efficiency for diagnostic value. Compared with a single holistic judgment, it must generate binary questions and answer each of them. This increases both the number of model calls and the total amount of text processed during evaluation. Prompt updating adds note-taking, lesson deduplication, and meta-prompt rewriting, though batching keeps the first two modest and prompt rewriting is shared by most update methods. The main recurring cost is question-level evaluation.
Limitations.The method still depends on question quality: if important criteria are missing, the final score will miss them. It also assumes that the fraction of satisfied questions maps approximately linearly to overall quality, which need not always hold.
6.1Decomposed Evaluation vs. Holistic Scoring
Consistency Example: Why question decomposition outperforms holistic scoringTask:Evaluate thefactual consistencyof a summary against its source article.Source article(excerpt):“The U.S. military said a Russian SU-27 Flanker conducted a barrel roll over a U.S. RC-135U … The Pentagon called the interceptunsafe and unprofessional… The Russian Defense Ministry said the jet was scrambled to identify the aircraft …”System summary:“The U.S. RC-135U was flying over the Baltic Sea when it was intercepted by a Russian SU-27 Flanker.The Pentagon saidthe Russian jet flew around the U.S. plane several timesto identify it and get its tail number. Read more:http://dailycaller.com/2017/10/29/…”BinEval (Claude) — decomposed evaluation:×\boldsymbol{\times}Q1.Claims supported?N— Pentagon’s statement is misattributed (Russia said “identify,” not the Pentagon).✓\boldsymbol{\checkmark}Q4.Numbers correct?Y— Aircraft types (RC-135U, SU-27) match the source.×\boldsymbol{\times}Q2.No fabrication?N— URLdailycaller.com/…not appear in the source.×\boldsymbol{\times}Q5.Causal relations preserved?N— Conflates Pentagon and Russian accounts of the intercept.×\boldsymbol{\times}Q3.Entities accurate?N— Misattributes Russia’s stated purpose to the Pentagon.✓\boldsymbol{\checkmark}Q6–7.No hallucinations? No misrepresentation of scope?Y— Core event is described.Score:3/73/7questions answeredY⇒3/7≈0.43→scale to 1–51.57\;\Rightarrow\;3/7\approx 0.43\;\xrightarrow{\text{scale to 1--5}}\;\mathbf{1.57}(human:2.0, error: 0.43)Key insight.G-Eval and UniEval assign perfect consistency (5.0) because the summarylooks plausibleat a surface level — it mentions the correct aircraft and event. BinEval’s decomposed questions probe each factual claim independently, catching the misattribution (Q1, Q3), fabricated URL (Q2), and conflated accounts (Q5). The resulting3/73/7score closely matches the human rating of 2.0, while holistic methods miss every error.Figure 4:Illustrative SummEval consistency example. The summary contains subtle factual errors (underlined) that holistic scoring methods miss. BinEval decomposes consistency into seven binary questions, each targeting a specific error type, producing a score closely aligned with the human judgment.Figure4illustrates a representative failure mode of holistic evaluation methods. The summary under evaluation contains three distinct factual errors (underlined): a misattribution of Russia’s stated purpose to the Pentagon, a fabricated external URL absent from the source, and a conflation of the two parties’ accounts of the intercept. Despite these errors, both G-Eval and UniEval assign a perfect consistency score of 5.0, because the summary issurface-plausible—it names the correct aircraft types and describes the general event accurately. Holistic scoring conflates local correctness with global consistency, rewarding fluent, topically coherent text even when specific claims are wrong.
BinEval avoids this by decomposing consistency into seven targeted binary questions, each probing a distinct claim type: factual support, fabrication, entity accuracy, numerical correctness, causal fidelity, hallucination, and scope representation. Questions Q1, Q3, and Q5 directly surface the misattribution and conflation; Q2 flags the fabricated URL. The resulting score of3/7≈1.573/7\approx 1.57(scaled to 1–5) closely matches the human rating of 2.0 (|Δ|=0.43|\Delta|=0.43), whereas G-Eval and UniEval diverge by 3.0 points. This example motivates the core design principle of BinEval: fine-grained binary questions act asclaim-level probes, making errors visible that aggregate scoring systematically obscures. Critically, this granularity also makes the feedback actionable, as each failed question directly identifies the error type, enabling targeted corrections to either the summarizer or the evaluator prompt.
7Conclusion
We presentedBinEval, a task-agnostic, training-free framework that evaluates LLM outputs by decomposing criteria into atomic binary questions. Across SummEval, Topical-Chat, and QAGS, it matches or outperforms strong evaluators while also supporting iterative prompt optimization on summarization and IFBench. Because each score is grounded in individual verdicts with explanations,BinEvaloffers interpretable feedback that helps practitioners diagnose and improve LLM systems, and suggests atomic binary decomposition as a promising direction for broader evaluation tasks. These results indicate that interpretability and strong evaluation performance need not come at the expense of scalability or flexibility. Looking ahead, we see natural extensions to agentic and multi-turn settings, where fine-grained, claim-level feedback is especially valuable for identifying where and why a system goes wrong.
Impact Statement
This paper presents work whose goal is to advance the field of machine learning through more interpretable and scalable evaluation of language model outputs. There are many potential societal consequences of our work, including the possibility of improving the reliability of automated evaluation pipelines used in research and deployment. At the same time, evaluator models can inherit the biases and blind spots of the underlying language models used to instantiate them, so any deployment ofBinEvalshould be paired with human oversight in high-stakes settings.
References
- F. Alam, G. Bhatia, S. R. Laskar, and S. A. Chowdhury (2026)Beyond LLM-as-a-judge: deterministic metrics for multilingual generative text evaluation.arXiv preprint arXiv:2604.05083.Cited by:§2,§2.
- S. Banerjee and A. Lavie (2005)METEOR: an automatic metric for MT evaluation with improved correlation with human judgments.InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,Cited by:§2.
- S. Es, J. James, L. Espinosa-Anke, and S. Schockaert (2024)RAGAS: automated evaluation of retrieval augmented generation.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics,Cited by:§2.
- A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev (2021)SummEval: re-evaluating summarization evaluation.Transactions of the Association for Computational Linguistics9,pp. 391–409.Cited by:§1,§4.2.
- O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Mober,et al.(2023)DSPy: compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714.Cited by:§2.
- T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal (2022)Decomposed prompting: a modular approach for solving complex tasks.arXiv preprint arXiv:2210.02406.Cited by:§3.1,§5.6.
- S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, and M. Seo (2024)Prometheus 2: an open source language model specialized in evaluating other language models.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp. 4334–4353.Cited by:§2.
- L. Lage and S. Ostermann (2025)OpenFActScore: open-source atomic evaluation of factual precision in long-form text generation.arXiv preprint arXiv:2502.09676.Cited by:§2.
- Q. Lemesle, J. Chevelu, P. Martin, D. Lolive, A. Delhay, and N. Barbot (2025)Paraphrase generation evaluation powered by an LLM: a semantic metric, not a lexical one.InProceedings of the 31st International Conference on Computational Linguistics,Cited by:§2.
- X. L. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto (2023)AlpacaEval: an automatic evaluator of instruction-following models.Note:GitHub repositoryCited by:§2.
- C. Lin (2004)ROUGE: a package for automatic evaluation of summaries.InText Summarization Branches Out,pp. 74–81.Cited by:§1,§2.
- Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu (2023)G-Eval: NLG evaluation using GPT-4 with better human alignment.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:§1,§2.
- S. Mehri and M. Eskenazi (2020)USR: an unsupervised and reference free evaluation metric for dialog generation.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Cited by:§1,§4.2.
- S. Min, K. Krishna, X. Lyu, M. Lewis, W. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi (2023)FActScore: fine-grained atomic evaluation of factual precision in long form text generation.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:§2.
- S. Narayan, S. B. Cohen, and M. Lapata (2018)Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.),Brussels, Belgium,pp. 1797–1807.External Links:Link,DocumentCited by:§4.2.
- K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab (2024)Optimizing instructions and demonstrations for multi-stage language model programs.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp. 9340–9366.Cited by:§2.
- K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002)BLEU: a method for automatic evaluation of machine translation.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,Cited by:§1,§2.
- V. Pyatkin, S. Malik, V. Graf, H. Ivison, S. Huang, P. Dasigi, N. Lambert, and H. Hajishirzi (2025)Generalizing verifiable instruction following.External Links:2507.02833,LinkCited by:§4.3.
- J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia (2024)ARES: an automated evaluation framework for retrieval-augmented generation systems.arXiv preprint arXiv:2311.09476.Cited by:§2.
- A. See, P. J. Liu, and C. D. Manning (2017)Get to the point: summarization with pointer-generator networks.External Links:1704.04368,LinkCited by:§4.2.
- A. Wang, K. Cho, and M. Lewis (2020)Asking and answering questions to evaluate the factual consistency of summaries.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Cited by:§1,§4.2.
- C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen (2023)Large language models as optimizers.arXiv preprint arXiv:2309.03409.Cited by:§2.
- L. Yin and Z. Wang (2025)LLM-AutoDiff: auto-differentiate any LLM workflow.arXiv preprint arXiv:2501.16673.Cited by:§2.
- W. Yuan, G. Neubig, and P. Liu (2021)BARTScore: evaluating generated text as text generation.InAdvances in Neural Information Processing Systems,Cited by:§2.
- T. Yue, R. Mao, X. Shi, S. Zhan, Z. Yang, and D. Zhao (2025)QAEval: mixture of evaluators for question-answering task evaluation.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),pp. 14717–14730.Cited by:§2.
- J. Zhang, Z. Wang, H. Zhu, J. Liu, Q. Lin, and E. Cambria (2025)MARS: a multi-agent framework incorporating socratic guidance for automated prompt optimization.arXiv preprint arXiv:2503.16874.Cited by:§2.
- T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020)BERTScore: evaluating text generation with BERT.InProceedings of the 8th International Conference on Learning Representations,Cited by:§1,§2.
- W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger (2019)MoverScore: text generation evaluating with contextualized embeddings and earth mover distance.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,Cited by:§2.
- L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing,et al.(2023)Judging LLM-as-a-judge with MT-bench and chatbot arena.InAdvances in Neural Information Processing Systems,Cited by:§1,§2.
- M. Zhong, Y. Liu, D. Yin, Y. Zhu, C. Zhu, and M. Zeng (2022)Towards a unified multi-dimensional evaluator for text generation.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Cited by:§2,§4.2,§4.2.
- D. Zhou, N. Sch”arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. H. Chi (2022)Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625.Cited by:§3.1,§5.6.
- H. Zhou, H. Huang, R. Zhang, K. Chen, B. Xu, C. Zhu, T. Zhao, and M. Yang (2026)Toward robust LLM-based judges: taxonomic bias evaluation and debiasing optimization.arXiv preprint arXiv:2603.08091.Cited by:§2.
- Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba (2023)Large language models are human-level prompt engineers.InProceedings of the 11th International Conference on Learning Representations,Cited by:§2.
Appendix ACase Study
A.1Effective Evaluation: Illustrative Examples
(a) Coherence — Single-sentence summarySummary:“Speed camera has been turned round and is pointing at this house in Birmingham, West Midlands.”Decomposed reasoning:YQ1 (structured)YQ2 (logical order)YQ3 (transitions)YQ4 (no repetition)YQ5 (unified focus)YQ6 (main topic)NQ7 (misses some details)YQ8 (no contradictions)Insight:A single sentencetriviallysatisfies ordering, non-contradiction, and focus criteria. The oneN(incomplete coverage) yields a proportional penalty:7/8→4.567/8\to 4.56, closely matching the human score. Holistic methods conflatecompletenesswithcoherence, assigning the minimum score.
(b) Consistency — Subtle factual errors in a plausible summarySource(excerpt):“The Pentagon called the interceptunsafe and unprofessional… The Russian Defense Ministry said the jet was scrambledto identifythe aircraft…”Summary:“The U.S. RC-135U was flying over the Baltic Sea when it was intercepted by a Russian SU-27 Flanker.The Pentagon saidthe Russian jet flew around the planeto identify it. Read more:http://dailycaller.com/…”Decomposed reasoning:NQ1 (misattributes Russia’s purpose to Pentagon)NQ2 (fabricated URL)NQ3 (wrong entity role)YQ4 (aircraft types correct)NQ5 (conflates Pentagon/Russian accounts)YQ6–7 (core event described)Insight:The summary mentions correct entities (RC-135U, SU-27) and describes the real event, so holistic methods see it as consistent. BinEval’s decomposed questions probeeach claim independently, catching the misattribution (NQ1, Q3), fabricated URL (NQ2), and conflated accounts (NQ5). Score:3/7→1.573/7\to 1.57, close to human 2.0.
(c) Fluency — Garbled summary with partial readabilitySummary:“ ‘Space invaders’ was developed in japan back in 1970. Japanese can sleep soundly in their beds tonight as government’s top military official. He also fought muhammad ali in 1976. Inoki has appeared in the u.s.-based wwe.”Decomposed reasoning:NQ1 (sentence fragment: “as government’s top military official”)YQ2 (no spelling errors)NQ3 (punctuation: backtick quotes, missing caps)NQ4 (imprecise: “1970” vs “late 1970s”)NQ5 (run-on fragment in sentence 2)NQ6 (unnatural jumps between topics)NQ7 (requires re-reading)YQ8 (main points still comprehensible)Insight:Human annotators rate this 2/3 (not the worst) because the text is partially readable despite errors. BinEval captures this nuance: Q2 (Y, no spelling errors) and Q8 (Y, comprehensible gist) prevent a floor score. Score:2/8→1.502/8\to 1.50, between 1 and 2. G-Eval and UniEval assign the minimum because any fluency issue triggers a blanket negative judgment.
Figure 5:Four illustrative SummEval examples, one per evaluation dimension. In each case, BinEval’s question decomposition produces scores closely aligned with human judgments by independently assessing multiple quality facets. Holistic methods (G-Eval, UniEval with gpt-oss) collapse to extreme scores on edge cases—short-but-correct summaries, partially readable text, or concise one-liners—because a single judgment conflates orthogonal quality dimensions.(d) Relevance — Concise but topically relevant one-linerSource(excerpt):“ISIS released more than 200 Yazidis… mostly women, children, and elderly… A senior Peshmerga commander said they were released in groups… The freed captives appeared very tired…”Summary:“ISIS released over 200 Yazidis on Wednesday.”Decomposed reasoning:NQ1 (omits key details: demographics, conditions)YQ2 (no fabricated content)YQ3 (no redundancy)YQ4 (no trivial padding)NQ5 (too sparse, misses important aspects)YQ6 (included content is relevant)Insight:The summary captures the central event accurately but is too brief. BinEval rewards what itdoesright (on-topic, no fabrication, no padding) while penalising omissions (Q1, Q5). Score:4/6→3.674/6\to 3.67, matching human 3.33. G-Eval and UniEval again conflateincompletenesswithirrelevance, assigning the minimum score despite the summary being factually on-topic.HumanBinEval (Cl.)BinEval (gpt-oss)G-Eval (GPT-4)G-Eval (gpt-oss)UniEval (T5)UniEval (gpt-oss)4.674.565.03.51.04.21.0012345(a) Coherence
HumanBinEval (Cl.)BinEval (gpt-oss)G-Eval (GPT-4)G-Eval (gpt-oss)UniEval (T5)UniEval (gpt-oss)2.01.573.04.05.04.65.0012345(b) Consistency
HumanBinEval (Cl.)BinEval (gpt-oss)G-Eval (GPT-4)G-Eval (gpt-oss)UniEval (T5)UniEval (gpt-oss)2.01.51.02.01.02.51.00123(c) Fluency
HumanBinEval (Cl.)BinEval (gpt-oss)G-Eval (GPT-4)G-Eval (gpt-oss)UniEval (T5)UniEval (gpt-oss)3.333.674.333.51.03.81.0012345(d) Relevance
Figure 6:Score comparisons for four illustrative SummEval examples, one per dimension. Dashed line marks the human reference. BinEval (Claude) consistently tracks human scores across all dimensions. G-Eval (GPT-4) and UniEval (T5) — the published baselines — perform reasonably, but when their evaluation formats are applied to gpt-oss without Monte Carlo sampling or fine-tuning, scores collapse on edge cases.
A.2Prompt Evolution: Illustrative Examples
This section illustrates how BinEval’s iterative prompt update modifies evaluation and generation prompts across iterations, with examples of both successful updates and failure modes.
A.2.1Example 1: Self-Update on Coherence (SummEval)
Result:Spearmanρ\rhoimproved from .521 (baseline) to .610 (iteration 1).
The self-update pipeline identified that the baseline coherence prompt wastoo stricton single-sentence summaries and penalized omission of background details, while human annotators focused primarily on logical flow. Three representative lessons were extracted:
Extracted Lessons (Self-Update, Coherence)1.Implicit transitions are acceptable.Require logical connections but do not demand explicit cue words (“because,” “therefore”). Implicit continuity suffices if the narrative flows.2.Add a central-claim relevance criterion.Each sentence should advance the article’s main claim; sentences that do not contribute are non-contributory regardless of grammatical correctness.3.Do not penalize omission of background details.Missing context should not lower coherence as long as the core fact and conflict remain clear.
These lessons produced targeted edits to the evaluation rubric:
Table 5:Coherence prompt: key changes from iteration 0 to iteration 1.Why it works:The lessons correctly identified a systematic bias—the model over-penalized brevity—and the updated rubric explicitly instructs the evaluator to tolerate omissions while adding a concrete “central claim” criterion that better aligns with how human annotators judge coherence.
A.2.2Example 2: Cross-Model Update on Consistency (SummEval)
Result:Spearmanρ\rhoimproved from .501 (baseline) to .637 (iteration 1).
Claude (source evaluator) correctly distinguished betweenomission(not mentioning a source fact) andcontradiction(stating something unsupported). gpt-oss (target) conflated these, penalizing summaries that simply omitted details. Key disagreement-driven lessons:
Extracted Lessons (Cross-Model, Consistency)1.Omission≠\neqinconsistency.A summary that omits details from the source is not factually inconsistent; only statementspresent in the summarythat are unsupported should be penalized.2.Semantic equivalence via arithmetic.Converting “83rd minute” to “seven minutes remaining” (in a 90-minute match) is a valid transformation, not a hallucination.3.Subject–role misattribution.When summaries restructure clauses, verify that entities are attached to the correct verbs (e.g., “X restarted his row with Z” misattributes if the source says “X had a row with Y and drew 0–0 with Z”).
The updated prompt grew substantially (from 4 evaluation steps to 6, with detailed guidance on literal interpretation, subject verification, and semantic equivalence). The critical structural addition:
Added to Evaluation Steps (cross-model, iteration 1): “For each statement in the summary, check whether it is supported by the article.The summary does not need to cover all details from the article. Omitting information is not a factual error.Only flag statements that are present in the summary but are unsupported or contradicted.”
Why it works:The cross-model signal pinpointed a fundamental conceptual error (conflating omission with contradiction) that human-score divergence alone could not have surfaced so clearly. The +.136 improvement—the largest in our experiments—demonstrates that inter-model disagreement can identify systematic evaluation biases that self-reflection misses.
A.2.3Example 3: Failure Case — Relevance (SummEval)
Result:Spearmanρ\rhodecreasedfrom .505 to .357 after applying lessons.
The self-update pipeline correctly diagnosed that the model was too lenient on relevance—giving perfect scores to summaries that captured the headline fact but omitted key actors and motivations. However, the fix made the prompttoo strict:
Extracted Lessons (Self-Update, Relevance — Led to Degradation)1.Make the rubric stricter about coverage of essential context, not just the headline fact.2.Require the evaluator to check foreverykey actor,everymotivation, andeverybackground event.3.Apply quantitative penalties:−1-1per missing key actor,−0.5-0.5per missing motivation or background event.
The resulting prompt decomposed relevance into exhaustive sub-criteria (actors, motivations, background events, factual propositions, redundancy) with a rigid penalty system. The regenerated binary questions reflected this over-specificity:
Regenerated questions (relevance, failed iteration):1.Does the summary includeevery key actormentioned in the source?2.Does the summary includeevery motivationfor actions stated in the source?3.Does the summary includeall background eventsdirectly relevant to the headline?4.Does the summary containevery other factual proposition(dates, locations, amounts)?5.Does the summary avoid irrelevant or redundant information?
Why it fails:Human annotators use aholisticjudgment for relevance—“did the summary capture the gist?”—with soft tolerance for missing minor details. The updated questions demandexhaustivecoverage, causing the model to rate almost all summaries as deficient. The resulting scores are systematically lower than human scores, destroying rank correlation. This illustrates a fundamental limitation: when the human evaluation criterion is inherently holistic and tolerant, decomposing it into strict atomic checks produces a harsher evaluator that diverges from human behavior.
A.2.4Example 4: IFBench — Promptable vs. Computational Constraints
Result:Format accuracy improved from 52% to 69%; count accuracy degraded from 63% to 31%.
The IFBench meta prompt starts minimal (22 characters:“Respond to the query.”). As shown inTable6, after 4 iterations of lesson extraction and prompt rewriting, it grows to 6,248 characters. The lessons fall into two categories:
Promptable lessons (effective).
For format and sentence constraints, lessons identify missing guidance that the model can follow:
IFBench: Effective Lessons (Format/Sentence)•“Output must be plain text with no markup unless explicitly required.”•“For repeat-type tasks, output ONLY the exact original request with the specified minimal change. Do not add explanations.”•“Obey the requested format exactly. Every line must follow the structure (indentation, list marker, newline) as described.”
Computational lessons (ineffective).
For count and ratio constraints, lessons correctly diagnose the problem but prescribe unactionable instructions:
IFBench: Ineffective Lessons (Count/Ratio)•“Maintain a running counter for each required element and stop when the target count is reached.”←\leftarrowmodel cannot execute•“Programmatically verify the position of each token; if the count is wrong, rewrite until satisfied.”←\leftarrowrequires self-verification loop•“Construct a structural outline and reference it when placing required words.”←\leftarrowimplicit reasoning, not enforced
The accumulation of these unactionable instructions causesprompt bloat:
Table 6:IFBench meta prompt growth and its effect on accuracy.Insight:At iteration 3, the prompt is large enough to contain useful format guidance but not yet so bloated that attention competition degrades all categories. By iteration 4, the accumulated computational instructions (which the model cannot follow) create noise that interferes with previously-working format guidance, causing a collapse across all categories. This reveals acarrying capacityfor prompt-based optimization: beyond a critical prompt length, additional instructions become counterproductive regardless of their correctness.
Appendix BExperimental Setups
SummEval — Evaluator Prompt Optimization.We optimize the evaluator prompt for gpt-oss-120b on all four SummEval dimensions: coherence, consistency, fluency, and relevance. SummEval contains 1,600 items (100 documents×\times16 summarization systems) with human Likert ratings on a 1–5 scale.
- •Data split.We randomly sample 10 items per system (seed=42=42), yielding 160 development items for lesson extraction and 1,440 test items for evaluation. The development set spans 82 of the 100 documents, providing broad coverage while keeping the update loop manageable.
- •Models.The target evaluator is gpt-oss-120b with temperature0. For self-update, the same model also serves as the note-taker for lesson extraction, semantic deduplication, and prompt rewriting. For cross-model update, Claude Sonnet 4 serves as both the source evaluator and the note-taker, again with temperature0.
- •Procedure.Each iteration follows Algorithm1: (1) evaluate the development and test sets with the current prompt and binary questions; (2) for self-update, identify items where the model score diverges most from the human score (|smodel−shuman|>0.3|s_{\mathrm{model}}-s_{\mathrm{human}}|>0.3after normalization), while for cross-model update we identify question-level disagreements between the source and target evaluators; (3) extract lessons from these failures or disagreements in batches, semantically deduplicate them with an LLM, and retain up to 10 unique lessons; (4) rewrite the evaluator prompt in a single LLM call incorporating all retained lessons; and (5) regenerate binary questions from the updated prompt for self-update.
- •Early stopping.We run up to 5 iterations and stop when the test-set Spearmanρ\rhodecreases relative to the previous iteration.
- •Metric.We report pooled Spearman rank correlation across all test items rather than a per-document average, since per-document averaging discards documents with fewer than two systems in the sparse development split.
IFBench — Generation Prompt Optimization.We optimize the generation meta-prompt for gpt-oss-120b on IFBench, an instruction-following benchmark with 290 test cases spanning 56 constraint types across 7 categories: count, words, format, ratio, sentence, repeat, and custom. Each case includes a programmatic verification function.
- •Data split.The development set contains 56 samples, one per constraint type, preferring previously failed cases; the test set contains the remaining 238 samples.
- •Models.The generator is gpt-oss-120b with temperature0. For self-update, the judge is also gpt-oss-120b. For cross-model update, the judge is Claude Sonnet 4.
- •Binary question decomposition.For each development sample, we convert the IFBench constraint specification into a natural-language description and decompose it into binary yes/no questions using theBinEvalmeta-prompt. For example, a constraint requiring one occurrence ofdoorand two occurrences ofbreadbecomes questions such as whether the response includesdoorexactly once andbreadexactly twice.
- •Procedure.Each iteration: (1) generate responses on all 290 samples with the current meta-prompt; (2) evaluate the development responses with an LLM judge using binary questions; (3) extract and deduplicate lessons from development failures; (4) rewrite the generation meta-prompt; and (5) evaluate on the test set using the official IFBench verification functions in strict mode.
- •Iterations.Self-update runs for 5 iterations. Cross-model update stops after 2 iterations because test accuracy decreases.
Appendix CAutomatic Prompt Update Algorithm
Input:Source evaluator
EsrcE_{\mathrm{src}}, target evaluator
EtgtE_{\mathrm{tgt}}with initial prompt
PE(0)P_{E}^{(0)}, binary questions
Q={q1,…,qN}Q=\{q_{1},\ldots,q_{N}\}, test data
{(xj,yj)}j=1J\{(x_{j},y_{j})\}_{j=1}^{J}, note-taker LLM
LnoteL_{\mathrm{note}}, updater LLM
LupdateL_{\mathrm{update}}, tolerance
ϵ\epsilon, max iterations
TT Output:Updated prompt
PE(T)P_{E}^{(T)} 1fort←1t\leftarrow 1toTTdo
//Step 1: Evaluate with both models
2foreach*(xj,yj)(x_{j},y_{j})in test data*do
3
Ajsrc←{fEsrc(xj,yj,qi)}i=1NA_{j}^{\mathrm{src}}\leftarrow\{f_{E_{\mathrm{src}}}(x_{j},y_{j},q_{i})\}_{i=1}^{N} 4
Ajtgt←{fEtgt(xj,yj,qi;PE(t−1))}i=1NA_{j}^{\mathrm{tgt}}\leftarrow\{f_{E_{\mathrm{tgt}}}(x_{j},y_{j},q_{i};P_{E}^{(t-1)})\}_{i=1}^{N} 5
//Check convergence
6foreachdimensionddinDDdo
7
Sdtgt←(1/|Qd|)∗∑qi∈Qdmeanj[Ajtgt(qi)]S_{d}^{\mathrm{tgt}}\leftarrow(1/|Q_{d}|)*\sum_{q_{i}\in Q_{d}}\mathrm{mean}_{j}[A_{j}^{\mathrm{tgt}}(q_{i})] 8
Sdsrc←(1/|Qd|)∗∑qi∈Qdmeanj[Ajsrc(qi)]S_{d}^{\mathrm{src}}\leftarrow(1/|Q_{d}|)*\sum_{q_{i}\in Q_{d}}\mathrm{mean}_{j}[A_{j}^{\mathrm{src}}(q_{i})] 9
10if*|Sdtgt−Sdsrc|<ϵ|S_{d}^{\mathrm{tgt}}-S_{d}^{\mathrm{src}}|<\epsilonfor alldd*then
returnPE(t−1)P_{E}^{(t-1)}
//Converged
11
//Step 2: Identify disagreements
12foreach*(xj,yj)(x_{j},y_{j})in test data*do
13
Δj←{qi:Ajsrc(qi)≠Ajtgt(qi)}\Delta_{j}\leftarrow\{q_{i}:A_{j}^{\mathrm{src}}(q_{i})\neq A_{j}^{\mathrm{tgt}}(q_{i})\} 14
//Step 3: Extract lessons from disagreements
15
Lall←L_{\mathrm{all}}\leftarrowempty list
16foreachjjwhere|Δj|>0|\Delta_{j}|>0do
17
Lj←Lnote(xj,yj,Ajsrc,Ajtgt,Δj)L_{j}\leftarrow L_{\mathrm{note}}(x_{j},y_{j},A_{j}^{\mathrm{src}},A_{j}^{\mathrm{tgt}},\Delta_{j}) 18
Lall←Lall+LjL_{\mathrm{all}}\leftarrow L_{\mathrm{all}}+L_{j} 19
//Step 4: Semantic deduplication
M←M\leftarrowempty list
//Lesson memory
20foreach*lnewl_{\mathrm{new}}inLallL_{\mathrm{all}}*do
21
(is_dup,merge_idx,merged)←Dedup_LLM(lnew,M)(\mathrm{is\_dup},\mathrm{merge\_idx},\mathrm{merged})\leftarrow\mathrm{Dedup\_LLM}(l_{\mathrm{new}},M) 22ifis_dupthen
23
M[merge_idx]←mergedM[\mathrm{merge\_idx}]\leftarrow\mathrm{merged} 24
25else
26
M.append(lnew)M.\mathrm{append}(l_{\mathrm{new}}) 27
28
29
Lunique←ML_{\mathrm{unique}}\leftarrow M //Step 5: Update target evaluator prompt
30
PE(t)←PE(t−1)P_{E}^{(t)}\leftarrow P_{E}^{(t-1)} 31foreach*lkl_{k}inLuniqueL_{\mathrm{unique}}*do
32
(sk,sk′)←Lupdate(PE(t),lk)(s_{k},s_{k}^{\prime})\leftarrow L_{\mathrm{update}}(P_{E}^{(t)},l_{k}) 33
PE(t)←PE(t).replace(sk,sk′)P_{E}^{(t)}\leftarrow P_{E}^{(t)}.\mathrm{replace}(s_{k},s_{k}^{\prime}) 34
35
returnPE(T)P_{E}^{(T)}
Algorithm 1Iterative Prompt Update via Binary Question Disagreement
Appendix DResults
D.1BinEvalEvaluation Results on Topical-Chat
Table7establishes four main findings. First,BinEval(Claude) is the strongest overall method on Topical-Chat, with the best average Spearman and Kendall correlations (0.632 / 0.525), outperforming G-Eval (GPT-4), UniEval (T5), and all lexical baselines. This indicates that multi-question binary decomposition is a strong evaluation paradigm for dialogue, where quality depends on several partially independent criteria rather than a single aggregate impression.
Second, different methods are strongest on different dimensions.BinEval(Claude) performs best on the most subjective dimensions, naturalness and engagingness, improving over G-Eval (GPT-4) by 0.137 and 0.113 in Spearman correlation. By contrast, UniEval (T5) and G-Eval remain stronger on coherence, where a more holistic representation may better capture global logical flow. Groundedness is comparatively method-agnostic: all LLM-based evaluators are within a narrow range, suggesting that this dimension is easier to capture regardless of evaluation format.
Third, evaluator quality is a first-order factor.BinEval(gpt-oss) reaches 0.539 / 0.450 on average, close to G-Eval (gpt-oss) at 0.541 / 0.478 and far above UniEval (gpt-oss) at 0.144 / 0.132, but still well below the Claude-based version. In other words, good question decomposition helps, but the evaluator must still be capable of answering conversational questions with enough nuance. This is especially clear for UniEval (gpt-oss): a single binary question often collapses to nearly constant outputs, such as naturalness at 0.
Fourth, question design is helpful but bounded. TheBinEval(gpt-oss) remains competitive because multiple binary questions create useful score granularity even when single-score calibration is weak, but it still does not match the Claude-based evaluator. Overall, the Topical-Chat results show that decomposition is particularly valuable for subjective dialogue qualities, while still depending on evaluator strength for best performance.
The violin plots reinforce these trends. In Figure7,BinEval(Claude) most closely matches the human spread and skew across all four dimensions, preserving both the broader dispersion on engagingness and the more concentrated but still non-degenerate distributions on naturalness, coherence, and groundedness. UniEval (T5) remains high but noticeably compressed, with a pronounced ceiling effect on naturalness, coherence, and groundedness and much weaker alignment on engagingness. Among the gpt-oss-based evaluators,BinEval(gpt-oss) is more conservative than the Claude version but still retains meaningful variation across examples, whereas G-Eval (gpt-oss) is more compressed and UniEval (gpt-oss) is nearly degenerate across all dimensions, providing very little discrimination.
Table 7:Turn-level Spearmanρ\rho/ Kendallτ\taucorrelations on Topical-Chat.
Figure 7:Per-dimension score distributions on Topical-Chat.BinEval(Claude) most closely tracks the human distributions across naturalness, coherence, engagingness, and groundedness. UniEval (T5) exhibits clear ceiling effects, especially outside engagingness;BinEval(gpt-oss) remains more discriminative than the other gpt-oss-based baselines; and UniEval (gpt-oss) is nearly flat across dimensions.
Figure 8:Per-system score distributions on Topical-Chat.BinEval(Claude) best preserves the human ordering of systems while maintaining realistic within-system spread.BinEval(gpt-oss) follows the broad ranking but is more conservative in absolute score level, G-Eval (gpt-oss) compresses low- and mid-performing systems, and UniEval (gpt-oss) is nearly uninformative because its scores are almost constant across systems.Figure8shows the same pattern at the system level.BinEval(Claude) best preserves the human ordering of systems, separating stronger systems from weaker ones while keeping realistic within-system variation rather than collapsing all outputs into a narrow high-scoring band.BinEval(gpt-oss) also tracks the broad ranking but with lower absolute scores, suggesting that decomposition still helps even when the underlying evaluator is weaker. By contrast, G-Eval (gpt-oss) compresses much of the low-to-mid range, and UniEval (gpt-oss) is nearly flat across systems. Together, these plots illustrate the central advantage of decomposition: multiple targeted questions produce more realistic and discriminative score variation than a single holistic or near-Boolean judgment.
D.2BinEvalEvaluation Results on QAGS
Table 8:Correlation results on QAGS. Pearsonrr/ Spearmanρ\rho/ Kendallτ\taufor QAGS-CNN, QAGS-XSUM, and their average.Table8highlights the setting where question decomposition helps most.BinEval(Claude) is the strongest method overall, with the best average Pearson, Spearman, and Kendall correlations (0.604 / 0.620 / 0.534). It is strongest on rank-based metrics for both datasets, achieving the top Spearman values on CNN/DM and XSum (0.702 and 0.539), while remaining competitive in Pearson correlation against stronger regression-style baselines such as BARTScore on CNN/DM and G-Eval (GPT-4) on XSum.BinEval(gpt-oss) is also robust, reaching 0.543 / 0.563 / 0.492 on average and substantially outperforming the other gpt-oss based evaluators.
The dataset-level breakdown is also informative. CNN/DM is the easier split: most strong evaluators achieve reasonably high correlations, and bothBinEvalvariants perform well there, withBinEval(Claude) at 0.665 / 0.702 / 0.597 andBinEval(gpt-oss) at 0.651 / 0.642 / 0.551. XSum is harder for every method, but the relative pattern remains the same:BinEval(Claude) still gives the best Spearman correlation (0.539), narrowly ahead of G-Eval (GPT-4) at 0.537, whileBinEval(gpt-oss) remains competitive at 0.483. The additional gpt-oss baselines make the advantage of decomposition especially clear. G-Eval (gpt-oss) nearly collapses on QAGS, reaching only 0.140 / 0.132 / 0.131 on average, and UniEval (gpt-oss) recovers some signal because factual consistency is closer to a binary property, but still trails bothBinEvalvariants. In short, QAGS shows that decomposition is most valuable when a single holistic prompt fails to preserve enough ranking granularity.
Figure9supports the same conclusion in distributional form. For both CNN/DM and XSum, the human ratings are distinctly bimodal, with substantial mass near both 0 and 1.BinEval(Claude) is the method that most clearly preserves this structure: it keeps broad support across the full range instead of collapsing toward the top of the scale.BinEval(gpt-oss) is somewhat more conservative but still retains visible spread and separation. By contrast, UniEval (T5) is strongly overconfident on both datasets, with most of its mass concentrated near high scores, while G-Eval (gpt-oss) and UniEval (gpt-oss) become almost binary in the wrong way—they place much of the distribution at the extremes with very limited intermediate variation. This matters because a useful factual evaluator must distinguish mildly flawed summaries from clearly inconsistent ones, not just separate obviously correct cases from obviously incorrect ones.
Figure 9:Per-dataset score distributions on QAGS for human ratings,BinEval, and UniEval.Figure10makes the ranking behavior even clearer.BinEval(Claude) shows the cleanest positive trend on both datasets, with fitted lines that track the diagonal substantially better than the other methods.BinEval(gpt-oss) follows the same pattern, though with more dispersion, matching its strong but slightly lower correlations in Table8. UniEval (T5) produces a positive trend but compresses many predictions into a narrow upper band, which limits discrimination despite decent correlation. G-Eval (gpt-oss) is nearly flat on CNN/DM and only weakly increasing on XSum, while UniEval (gpt-oss) exhibits only coarse, quantized outputs. Together, the table and figures show that the key benefit of decomposition on QAGS is not only better correlation, but also better use of the score range:BinEvalassigns meaningfully different scores to different kinds of factual errors instead of collapsing them into a small set of near-identical predictions.
Figure 10:Per-summary scatter plots against human consistency scores on QAGS.
Appendix EBinary Questions for SummEval
Tables9–12list the binary questions auto-generated byBinEvalfor each SummEval evaluation dimension. These are the questions referenced inSection5.6andFigure3. Each question is designed so that “yes” indicates the output satisfies the criterion and “no” indicates a violation.
Table 9:Binary questions forCoherenceon SummEval (8 questions).Table 10:Binary questions forConsistencyon SummEval (7 questions).Table 11:Binary questions forFluencyon SummEval (7 questions).Table 12:Binary questions forRelevanceon SummEval (5 questions).
Similar Articles
@omarsar0: LLM-as-a-Judge explained in ~10 mins. Knowing how to build AI verifiers and judges is one of the most important emergin…
A quick introduction to the LLM-as-a-Judge concept, explaining how to build AI verifiers and judges, and pointing to resources to learn more.
@akshay_pachaar: If you use LLM-as-judge, this one is for you. (bookmark it) Most teams validate their agent's outputs by calling a fron…
Details an approach to train a small LLM judge for evaluating agent outputs, replacing costly frontier models, with a Claude Code plugin for deployment.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.