@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
Summary
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
Similar Articles
@ArizePhoenix: One of the oldest lessons in ML is still one of the most useful for working with LLM apps: Don’t evaluate on the same d…
This article discusses best practices for LLM application development using Arize Phoenix, specifically highlighting the importance of using train/validation/test splits for honest evaluation and tracking regressions.
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA
This paper tests the assumption that LLMs judge better than they generate in in-context QA, finding generation accuracy exceeds self-evaluation on most benchmarks, with evaluation attending less to context. The findings challenge core assumptions in self-evaluation pipelines.
@omarsar0: LLM-as-a-Judge explained in ~10 mins. Knowing how to build AI verifiers and judges is one of the most important emergin…
A quick introduction to the LLM-as-a-Judge concept, explaining how to build AI verifiers and judges, and pointing to resources to learn more.
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.