llm-judge

#llm-judge

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

arXiv cs.CL ↗ · 8h ago Cached

This paper introduces 'second-order bias', the bias LLMs exhibit when judging biased content, and proposes a reasoning task grounded in epistemic entitlement to evaluate it. Experiments show that the task evades safety guardrails and reveals systematic demographic biases in LLM judges.

0 favorites 0 likes

#llm-judge

things i wish i knew before evaluating AI agents in production

Reddit r/AI_Agents ↗ · 15h ago

Personal lessons on evaluating AI agents in production, including mapping symptoms to layers, using trajectory evaluation, calibrating LLM judges, converting failures to test cases, and performing adversarial testing.

0 favorites 0 likes

#llm-judge

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

arXiv cs.CL ↗ · yesterday Cached

This paper introduces a psychometric datasheet protocol for evaluating LLM judges as measurement instruments, measuring dark current, positional false preference, stable cross-sensitivity, and target sensitivity. A case study on three open-weight models reveals significant differences in judge quality and behavior.

0 favorites 0 likes

#llm-judge

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv cs.AI ↗ · 2026-06-10 Cached

RealMath-Eval is a benchmark of 224 real-world high school math exam responses that reveals a significant 'Evaluation Gap': state-of-the-art LLM judges perform poorly on authentic human reasoning (MSE ~2.96) compared to synthetic LLM-generated solutions (MSE ~1.17), due to higher diversity and surprisal in human error patterns.

0 favorites 0 likes

#llm-judge

Built an OSS spec-driven AI development tool that runs multiple agents in parallel on the same feature with an LLM-as-judge that picks the winner

Reddit r/AI_Agents ↗ · 2026-05-25

Aigon is an open-source tool that runs multiple AI coding agents in parallel on the same feature specified in a markdown spec and uses an LLM judge to select the best implementation, with a visual dashboard and optional scheduling.

0 favorites 0 likes

#llm-judge

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

Reddit r/LocalLLaMA ↗ · 2026-05-15

A detailed evaluation of a RAG customer support chatbot reveals that retrieval issues often masquerade as LLM problems, heuristic evaluators are misleading, deduplication improves quality, stricter grounding trades helpfulness for accuracy, and model sweeping can dramatically reduce cost while improving performance.

0 favorites 0 likes

#llm-judge

CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production

Hacker News Top ↗ · 2026-04-21 Cached

Brex open-sources CrabTrap, an LLM-as-a-judge HTTP proxy that filters and secures AI agent traffic before it reaches production services.

0 favorites 0 likes

#llm-judge

@pauliusztin_: Every day, 100+ people ask me, "How can I learn AI evals?" I copy-paste these 11 links (every time): 1. AI evals & obse…

X AI KOLs Timeline ↗ · 2026-04-21

A curated list of 11 links shared daily to help people learn AI evaluation techniques, covering evals, observability, LLM-as-judge, and agent evaluation.

0 favorites 0 likes

llm-judge

Submit Feedback