@HamelHusain: Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns All…
Summary
Hamel Husain shares flashcards and insights from an AI evaluation course, advocating for binary judges over Likert scales for practical LLM evaluation.
View Cached Full Text
Cached at: 06/29/26, 04:23 AM
Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns
All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://t.co/qfB4WJgX5n https://t.co/OvSdVi5rbB
AI Eval Flashcards | Hamel Husain & Shreya Shankar on Maven
Source: https://maven.com/parlance-labs/o/540bd8 Digital asset
![]()
Hamel Husain
ML Engineer with 20 years of experience

Shreya Shankar
ML Systems & Applied AI Evals Researcher
See all products fromHamel Husain & Shreya Shankar
.png&w=1536&q=75)
12 pages with visual bite-sized takeaways of the most important learnings from our course.
elvis (@omarsar0): If you use LLM-as-judge, this one is worth reading.
(bookmark it)
It’s actually one of the most effective ways to use LLM-as-a-Judge for evals.
Holistic judge scores hide both their reasoning and their ceiling effects.
BINEVAL decomposes each evaluation criterion into atomic
Similar Articles
@omarsar0: If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use L…
BinEval is a new framework that decomposes LLM evaluation criteria into atomic binary questions, improving interpretability and enabling targeted prompt optimization, achieving strong results on factual consistency benchmarks.
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.
LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation
This paper introduces a psychometric datasheet protocol for evaluating LLM judges as measurement instruments, measuring dark current, positional false preference, stable cross-sensitivity, and target sensitivity. A case study on three open-weight models reveals significant differences in judge quality and behavior.
Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
This paper explores which agreement statistics for LLM judge validation are redundant when criteria are binary, and provides a checklist for proper reporting including abstention handling.