Tag
This paper introduces a psychometric datasheet protocol for evaluating LLM judges as measurement instruments, measuring dark current, positional false preference, stable cross-sensitivity, and target sensitivity. A case study on three open-weight models reveals significant differences in judge quality and behavior.