llm-judges

#llm-judges

LLM Judges Can Be Too Generous When There Is No Reference Answer

arXiv cs.CL ↗ · 5d ago Cached

This paper shows that LLM judges tend to over-credit incorrect answers when no reference answer is provided, and adding a reference can flip verdicts by up to 85%, aligning more with human judgments. The authors propose calibration steps for using LLM judges in reference-free settings.

0 favorites 0 likes

#llm-judges

@HamelHusain: New Blog Post: Do Automated Evals Work? There has been a rise of tools that look through your traces with AI and identi…

X AI KOLs Timeline ↗ · 6d ago Cached

A blog post from Parlance Labs tests automated AI evaluation tools (Braintrust Loop, Arize Alyx, LangSmith Engine) on real production data, finding they catch 87% of issues humans flag but miss domain-specific failures and add noise, recommending iterative human-in-the-loop use.

0 favorites 0 likes

#llm-judges

@h100envy: Prime Intellect engineers explained how they train reasoning models over the open internet in 30 minutes - better than …

X AI KOLs Timeline ↗ · 2026-07-12 Cached

Prime Intellect engineers demonstrated a method to train reasoning models in 30 minutes using distributed RL over the open internet, utilizing Prime-RL, LLM judges, and multi-cloud GPUs, enabling open models to compete with closed labs without owning data centers.

0 favorites 0 likes

#llm-judges

@omarsar0: So much alpha in tuning/building LLM verifiers and judges. I use them on top of my harness, and it has unlocked agentic…

X AI KOLs Timeline ↗ · 2026-07-02 Cached

Omar highlights the growing value of building LLM verifiers and judges for agentic coding, while Mira Murati shares that Bridgewater partnered with TinkerAPI to fine-tune a model for financial analysis.

0 favorites 0 likes

#llm-judges

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

arXiv cs.CL ↗ · 2026-06-25 Cached

This paper systematically compares fine-tuned encoder classifiers (ModernBERT family) against decoder-based safety judges for LLM adversarial evaluation, finding that encoders can offer a cost- and latency-efficient alternative without significant performance loss.

0 favorites 0 likes

#llm-judges

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

arXiv cs.AI ↗ · 2026-06-16 Cached

This paper introduces Metric Match, a method for selecting a subset of samples for human annotation to estimate LLM judge reliability more efficiently, reducing annotation costs by 32.5% and achieving a win-rate of 0.838 against random selection.

0 favorites 0 likes

#llm-judges

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper investigates the ability of LLMs-as-judges for safety to adapt to contextual information and varying safety definitions, finding that they are largely rigid and fail to adjust when the context contradicts their internal priors.

0 favorites 0 likes

#llm-judges

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

arXiv cs.AI ↗ · 2026-06-04 Cached

This paper empirically examines when to interrupt autonomous AI agents during software execution, finding that affective-state thresholds saturate quickly, LLM judges achieve low F1 scores (0.17–0.40) at high cost, and human annotators themselves show near-chance agreement on intervention timing, making the construct unreliable as an optimization target.

0 favorites 0 likes

#llm-judges

@lateinteraction: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)

X AI KOLs Following ↗ · 2026-06-03 Cached

GEPA-optimized LLM judges from dspy are used for data filtering in Microsoft's MAI-Thinking-1 model pre-training pipeline.

0 favorites 0 likes

#llm-judges

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

arXiv cs.AI ↗ · 2026-06-01 Cached

Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.

0 favorites 0 likes

#llm-judges

Agent Judge: Solving Long-Context Evals for Production Agents (10 minute read)

TLDR AI ↗ · 2026-05-29 Cached

Agent Judge is an agentic evaluation harness that overcomes the limitations of simple LLM judges for long-horizon agents by handling long trajectories, verifying stateful actions against source-of-truth systems, and adapting to changing behavior.

0 favorites 0 likes

#llm-judges

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper introduces a causal framework to quantify rationalization bias in LLM judges, where verdicts and explanations are influenced by non-evidential cues rather than underlying texts. It proposes cue interventions, anchoring metrics, and the Proof-Before-Preference mitigation protocol, demonstrating improved cue invariance.

0 favorites 0 likes

#llm-judges

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

This paper identifies two failure modes in multi-objective prompt optimization for LLM judges using textual gradients: gradient dilution during optimization and instruction interference during inference, showing that joint gradient processing loses criterion-specific information.

0 favorites 0 likes

#llm-judges

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arXiv cs.CL ↗ · 2026-05-20 Cached

This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.

0 favorites 0 likes

llm-judges

Submit Feedback