Tag
GEPA-optimized LLM judges from dspy are used for data filtering in Microsoft's MAI-Thinking-1 model pre-training pipeline.
Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.
Agent Judge is an agentic evaluation harness that overcomes the limitations of simple LLM judges for long-horizon agents by handling long trajectories, verifying stateful actions against source-of-truth systems, and adapting to changing behavior.
This paper introduces a causal framework to quantify rationalization bias in LLM judges, where verdicts and explanations are influenced by non-evidential cues rather than underlying texts. It proposes cue interventions, anchoring metrics, and the Proof-Before-Preference mitigation protocol, demonstrating improved cue invariance.
This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.