Tag
This empirical survey extends prior work on the bias-reliability tradeoff in LLM evaluation by measuring evaluator coupling, strategy diversity, and small-sample reliability across 11 conditions, confirming that low evaluator influence leads to high measurement noise while strong coupling reduces diversity and noise.