evaluation-reliability

Tag

Cards List
#evaluation-reliability

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv cs.CL · yesterday Cached

This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.

0 favorites 0 likes
← Back to home

Submit Feedback