llm-judges

Tag

Cards List
#llm-judges

@lateinteraction: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)

X AI KOLs Following · 20h ago Cached

GEPA-optimized LLM judges from dspy are used for data filtering in Microsoft's MAI-Thinking-1 model pre-training pipeline.

0 favorites 0 likes
#llm-judges

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

arXiv cs.AI · 2d ago Cached

Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.

0 favorites 0 likes
#llm-judges

Agent Judge: Solving Long-Context Evals for Production Agents (10 minute read)

TLDR AI · 6d ago Cached

Agent Judge is an agentic evaluation harness that overcomes the limitations of simple LLM judges for long-horizon agents by handling long trajectories, verifying stateful actions against source-of-truth systems, and adapting to changing behavior.

0 favorites 0 likes
#llm-judges

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

arXiv cs.CL · 2026-05-26 Cached

This paper introduces a causal framework to quantify rationalization bias in LLM judges, where verdicts and explanations are influenced by non-evidential cues rather than underlying texts. It proposes cue interventions, anchoring metrics, and the Proof-Before-Preference mitigation protocol, demonstrating improved cue invariance.

0 favorites 0 likes
#llm-judges

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arXiv cs.CL · 2026-05-20 Cached

This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.

0 favorites 0 likes
← Back to home

Submit Feedback