llm-as-judge

#llm-as-judge

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

arXiv cs.CL ↗ · 5d ago Cached

This paper studies the relationship between token-level log-probability distributions, LLM-as-judge rubric scores, and final task accuracy in multi-agent debate systems. It finds a consistent four-phase confidence trajectory and role asymmetry between Constructor and Auditor agents.

0 favorites 0 likes

#llm-as-judge

POLARIS: Guiding Small Models to Write Long Stories

arXiv cs.CL ↗ · 2026-06-04 Cached

POLARIS is a training recipe using GRPO with LLM-as-judge rewards and human-reference injection to improve long-form story generation in small models. Applied to Qwen3.5-9B, the resulting POLARIS-9B model matches Qwen3.5-27B performance on creative writing benchmarks while better adhering to length instructions.

0 favorites 0 likes

#llm-as-judge

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

arXiv cs.AI ↗ · 2026-06-04 Cached

AICompanionBench introduces the first publicly available benchmark dataset of 2,123 real-world AI companion conversations annotated across nine safety risk categories, used to evaluate 20 LLMs as safety judges. Results show strong models handle explicit harmful content well but struggle with nuanced risks like manipulation and false positives on benign conversations.

0 favorites 0 likes

#llm-as-judge

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv cs.CL ↗ · 2026-06-03 Cached

This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.

0 favorites 0 likes

#llm-as-judge

Short-form Text Rewriting with Phi Silica

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper presents an empirical study adapting the small language model Phi Silica for short-form text rewriting through dataset curation, prompt distillation, and parameter-efficient fine-tuning, showing that targeted adaptation significantly improves semantic fidelity and reduces hallucinations.

0 favorites 0 likes

#llm-as-judge

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

arXiv cs.AI ↗ · 2026-06-01 Cached

GLIDE is an open-source Python library that unifies state-of-the-art Prediction-Powered Inference methods for debiased evaluation of generative AI and agentic systems, enabling annotation savings with valid uncertainty estimates.

0 favorites 0 likes

#llm-as-judge

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Hugging Face Daily Papers ↗ · 2026-06-01 Cached

This paper identifies perceptual judgment bias in multimodal LLM judges, where they over-reward fluent but visually wrong responses, and proposes a dataset PPJD and a trained model Perception-Judge using GRPO with batch-ranking reward to mitigate this bias and improve perception-grounded evaluation.

0 favorites 0 likes

#llm-as-judge

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper proposes a Deep Research pipeline that improves literature search recall by an order of magnitude and argues that human citation lists are not reliable ground truth for evaluation.

0 favorites 0 likes

#llm-as-judge

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

arXiv cs.AI ↗ · 2026-05-27 Cached

This paper identifies weighting noise in LLM judges for multi-stakeholder tasks and proposes DecompR, a method that decouples utility estimation from aggregation using counterfactually calibrated weights.

0 favorites 0 likes

#llm-as-judge

@Voxyz_ai: can't wait for this gbrain feature. here's the loop: agent attempts a task using a skill ↓ gbrain eval or LLM-as-judge …

X AI KOLs Following ↗ · 2026-05-26 Cached

Voxyz announces a new GBrain feature that enables agents to iteratively improve skills using LLM-as-judge evaluation and an overnight optimization cycle.

0 favorites 0 likes

#llm-as-judge

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.

0 favorites 0 likes

#llm-as-judge

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Hugging Face Daily Papers ↗ · 2026-05-17 Cached

This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.

0 favorites 0 likes

#llm-as-judge

Notes from evaluating a customer support chat agent system: heuristic evaluators give false signal, retrieval bugs masquerade as LLM failures, and the cost/quality Pareto frontier is rarely where you think [D]

Reddit r/MachineLearning ↗ · 2026-05-15

Practical findings from auditing a production customer support RAG system reveal that heuristic evaluators give false signal, retrieval bugs often masquerade as LLM failures, and the Pareto frontier for cost and quality is often not where expected. Sweeping models showed that replacing the incumbent (Gemini Flash Lite Preview) with Gemma 4 26B achieved a 19% quality improvement at 79% lower cost.

0 favorites 0 likes

#llm-as-judge

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.

0 favorites 0 likes

#llm-as-judge

Measuring information density in web pages from an LLM agent's perspective [R]

Reddit r/MachineLearning ↗ · 2026-05-08

This paper presents empirical measurements of information density in web pages from the perspective of LLM agents, using a curated benchmark of 100 URLs across five categories. It finds that structural extraction reduces token count by an average of 71.5% while preserving answer quality, and reveals an undocumented compression layer in Claude Code.

0 favorites 0 likes

#llm-as-judge

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

arXiv cs.CL ↗ · 2026-05-08 Cached

This study analyzes how modifications to evaluation rubrics, such as shifting from holistic to analytic criteria, impact the agreement between human raters and AI autoraters. The findings suggest that providing examples and reducing bias improves agreement, while higher complexity tends to decrease it.

0 favorites 0 likes

#llm-as-judge

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

arXiv cs.CL ↗ · 2026-04-22 Cached

Researchers from PNNL and Washington University introduce a systematic framework to test how five LLMs detect subtle semantic changes in documents, revealing positional bias, context coherence effects, and model-specific scoring fingerprints.

0 favorites 0 likes

llm-as-judge

Submit Feedback