llm-as-a-judge

#llm-as-a-judge

Judge Circuits

arXiv cs.CL ↗ · 22h ago Cached

This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.

0 favorites 0 likes

#llm-as-a-judge

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

arXiv cs.CL ↗ · 6d ago Cached

This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.

0 favorites 0 likes

#llm-as-a-judge

@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…

X AI KOLs Following ↗ · 2026-05-07

The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.

0 favorites 0 likes

llm-as-a-judge

Judge Circuits

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…

Submit Feedback