@HamelHusain: Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns All…

X AI KOLs Timeline 06/28/26, 01:52 AM Tools

ai-evals llm-judge binary-judges evaluation flashcards llm-evaluation

Summary

Hamel Husain shares flashcards and insights from an AI evaluation course, advocating for binary judges over Likert scales for practical LLM evaluation.

Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://t.co/qfB4WJgX5n https://t.co/OvSdVi5rbB

Original Article

View Cached Full Text

Cached at: 06/29/26, 04:23 AM

Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns

All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://t.co/qfB4WJgX5n https://t.co/OvSdVi5rbB

AI Eval Flashcards | Hamel Husain & Shreya Shankar on Maven

Source: https://maven.com/parlance-labs/o/540bd8 Digital asset

Hamel Husain

ML Engineer with 20 years of experience

Shreya Shankar

ML Systems & Applied AI Evals Researcher

See all products fromHamel Husain & Shreya Shankar

12 pages with visual bite-sized takeaways of the most important learnings from our course.

elvis (@omarsar0): If you use LLM-as-judge, this one is worth reading.

(bookmark it)

It’s actually one of the most effective ways to use LLM-as-a-Judge for evals.

Holistic judge scores hide both their reasoning and their ceiling effects.

BINEVAL decomposes each evaluation criterion into atomic

Similar Articles

@omarsar0: If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use L…

X AI KOLs Following

BinEval is a new framework that decomposes LLM evaluation criteria into atomic binary questions, improving interpretability and enabling targeted prompt optimization, achieving strong results on factual consistency benchmarks.

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv cs.CL

This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

arXiv cs.CL

This paper introduces a psychometric datasheet protocol for evaluating LLM judges as measurement instruments, measuring dark current, positional false preference, stable cross-sensitivity, and target sensitivity. A case study on three open-weight models reveals significant differences in judge quality and behavior.

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Hugging Face Daily Papers

This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

arXiv cs.CL

This paper explores which agreement statistics for LLM judge validation are redundant when criteria are binary, and provides a checklist for proper reporting including abstention handling.

AI Eval Flashcards | Hamel Husain & Shreya Shankar on Maven

Similar Articles

@omarsar0: If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use L…

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

Submit Feedback