llm-judge

#llm-judge

Position: Evaluation Scores Are Perishable Knowledge Claims

arXiv cs.AI ↗ · 2d ago Cached

This position paper argues that language model evaluation scores should be treated as perishable knowledge claims, not ground truth, and proposes explicit metadata such as formality tier, scope declaration, and expiration date to counter 'trust inflation' caused by averaging weak and strong signals.

0 favorites 0 likes

#llm-judge

@LangChain: How @Similarweb evaluates a Deep Research agent when there's no single right answer: Deterministic checks for tool call…

X AI KOLs Timeline ↗ · 3d ago Cached

This article explains how Similarweb evaluates long-form agent research reports using LangSmith, combining deterministic checks for tool calls and LLM-as-judge scoring for quality, with a focus on making regressions inspectable and enabling A/B comparisons.

0 favorites 0 likes

#llm-judge

@josephdecker: The winner of my 16-model eval fabricated 5 times in the audit. Second place, a tenth of a point back: zero fabrication…

X AI KOLs Timeline ↗ · 2026-07-24 Cached

Joseph Decker evaluates 16 AI models on truthfulness for his product Condensr, discovers that the leaderboard winner fabricated content five times in an audit, and instead ships the second-place model which had zero fabrications. The post details the evaluation process, a bug in the LLM judge that penalized accurate summaries due to truncated transcripts, and the importance of custom evals over generic benchmarks.

0 favorites 0 likes

#llm-judge

A Classifier That Teaches Itself: Self-Improving, Frozen-gate Training (SIFT) for Dynamic Document Classification

arXiv cs.CL ↗ · 2026-07-22 Cached

SIFT introduces a self-improving document classifier that uses a cheap SPLADE-LightGBM pipeline and an LLM judge to continuously teach itself, while a frozen-gate safety mechanism prevents silent regression, enabling autonomous retraining without human labeling overhead.

0 favorites 0 likes

#llm-judge

Evaluating medical AI under missing information: same-provider judges and human raters change apparent safety

arXiv cs.AI ↗ · 2026-07-22 Cached

This paper extends missing-information stress-testing to open-ended medical conversation, finding that LLM judge choice materially changes apparent safety and that LLM judges are more permissive than clinicians.

0 favorites 0 likes

#llm-judge

EduPanel: A Three-Agent LLM Judge for Teaching Videos -- Reliability, Complementarity, and Human Trust Calibration

Hugging Face Daily Papers ↗ · 2026-07-20 Cached

EduPanel is a rubric-grounded, learner-conditioned LLM judge that uses three specialized agents to evaluate teaching video quality, achieving reliability comparable to human experts and improving scoring accuracy while maintaining detectability of unreliable outputs.

0 favorites 0 likes

#llm-judge

Articulate Intuition or Genuine Analysis? Benchmarking Epistemic Reliability in LLM-as-a-Judge Peer Reviews

arXiv cs.CL ↗ · 2026-07-14 Cached

This paper introduces Kahneman4Review, a benchmark of 3,563 peer reviews rated along nine theoretically motivated textual dimensions, eight bias diagnostics, and a continuous reasoning-quality score, to evaluate whether LLM judges can distinguish analytical form from genuine epistemic quality in peer review.

0 favorites 0 likes

#llm-judge

CAFE: A Compound-AI Factorial Evaluation Framework

arXiv cs.CL ↗ · 2026-07-14 Cached

CAFE is an open-source platform that applies design-of-experiments principles to evaluate compound AI systems, attributing answer quality variance to components and their interactions using factorial designs and mixed-effects models.

0 favorites 0 likes

#llm-judge

Do You Need a Frontier Model as a Citation Verifier? Benchmarking Rubric LLMs for Deep-Research Source Attribution

arXiv cs.CL ↗ · 2026-07-10 Cached

This paper benchmarks 8 LLM judges for citation quality in deep-research systems, finding that cheaper models remain competitive with frontier models on source relevance and factual support, but differ in directional bias which matters for RL training.

0 favorites 0 likes

#llm-judge

What Predicts Correctness in Text-to-SQL? A Selective-Prediction Study

arXiv cs.LG ↗ · 2026-07-09 Cached

This paper studies which signals best predict correctness in text-to-SQL for selective prediction. It finds that verification-based signals from LLM judges outperform black-box statistical signals like self-consistency, and that a two-provider ensemble achieves 0.82 AUROC with well-calibrated probabilities.

0 favorites 0 likes

#llm-judge

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

arXiv cs.CL ↗ · 2026-07-01 Cached

This paper introduces TheraJudge and TheraAgent, a framework that uses multi-dimensional human-aligned evaluation to improve therapeutic response generation in LLMs, showing significant gains in quality and safety.

0 favorites 0 likes

#llm-judge

@akshay_pachaar: If you use LLM-as-judge, this one is for you. (bookmark it) Most teams validate their agent's outputs by calling a fron…

X AI KOLs Following ↗ · 2026-06-30 Cached

Details an approach to train a small LLM judge for evaluating agent outputs, replacing costly frontier models, with a Claude Code plugin for deployment.

0 favorites 0 likes

#llm-judge

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Hugging Face Daily Papers ↗ · 2026-06-29 Cached

This paper introduces approach-level diversity for LLM math reasoning, showing that surface-level diversity metrics are unreliable proxies and that directly optimizing for approach diversity remains an open problem.

0 favorites 0 likes

#llm-judge

@HamelHusain: Yes! binary judges are far more practical for most people, because likert scales (or scores) have too many footguns All…

X AI KOLs Timeline ↗ · 2026-06-28 Cached

Hamel Husain shares flashcards and insights from an AI evaluation course, advocating for binary judges over Likert scales for practical LLM evaluation.

0 favorites 0 likes

#llm-judge

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper introduces 'second-order bias', the bias LLMs exhibit when judging biased content, and proposes a reasoning task grounded in epistemic entitlement to evaluate it. Experiments show that the task evades safety guardrails and reveals systematic demographic biases in LLM judges.

0 favorites 0 likes

#llm-judge

things i wish i knew before evaluating AI agents in production

Reddit r/AI_Agents ↗ · 2026-06-16

Personal lessons on evaluating AI agents in production, including mapping symptoms to layers, using trajectory evaluation, calibrating LLM judges, converting failures to test cases, and performing adversarial testing.

0 favorites 0 likes

#llm-judge

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper introduces a psychometric datasheet protocol for evaluating LLM judges as measurement instruments, measuring dark current, positional false preference, stable cross-sensitivity, and target sensitivity. A case study on three open-weight models reveals significant differences in judge quality and behavior.

0 favorites 0 likes

#llm-judge

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv cs.AI ↗ · 2026-06-10 Cached

RealMath-Eval is a benchmark of 224 real-world high school math exam responses that reveals a significant 'Evaluation Gap': state-of-the-art LLM judges perform poorly on authentic human reasoning (MSE ~2.96) compared to synthetic LLM-generated solutions (MSE ~1.17), due to higher diversity and surprisal in human error patterns.

0 favorites 0 likes

#llm-judge

Built an OSS spec-driven AI development tool that runs multiple agents in parallel on the same feature with an LLM-as-judge that picks the winner

Reddit r/AI_Agents ↗ · 2026-05-25

Aigon is an open-source tool that runs multiple AI coding agents in parallel on the same feature specified in a markdown spec and uses an LLM judge to select the best implementation, with a visual dashboard and optional scheduling.

0 favorites 0 likes

#llm-judge

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

Reddit r/LocalLLaMA ↗ · 2026-05-15

A detailed evaluation of a RAG customer support chatbot reveals that retrieval issues often masquerade as LLM problems, heuristic evaluators are misleading, deduplication improves quality, stricter grounding trades helpfulness for accuracy, and model sweeping can dramatically reduce cost while improving performance.

0 favorites 0 likes

llm-judge

Submit Feedback