@HamelHusain: Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns All…

X AI KOLs Timeline Tools

Summary

Hamel Husain shares flashcards and insights from an AI evaluation course, advocating for binary judges over Likert scales for practical LLM evaluation.

Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://t.co/qfB4WJgX5n https://t.co/OvSdVi5rbB
Original Article
View Cached Full Text

Cached at: 06/29/26, 04:23 AM

Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns

All the flashcards are here (inspired by @chrisalbon ‘s flashcards) https://t.co/qfB4WJgX5n https://t.co/OvSdVi5rbB


AI Eval Flashcards | Hamel Husain & Shreya Shankar on Maven

Source: https://maven.com/parlance-labs/o/540bd8 Digital asset

Hamel Husain

Hamel Husain

ML Engineer with 20 years of experience

Shreya Shankar

Shreya Shankar

ML Systems & Applied AI Evals Researcher

See all products fromHamel Husain & Shreya Shankar

12 pages with visual bite-sized takeaways of the most important learnings from our course.

elvis (@omarsar0): If you use LLM-as-judge, this one is worth reading.

(bookmark it)

It’s actually one of the most effective ways to use LLM-as-a-Judge for evals.

Holistic judge scores hide both their reasoning and their ceiling effects.

BINEVAL decomposes each evaluation criterion into atomic

Similar Articles

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv cs.CL

This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Hugging Face Daily Papers

This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.