Tag
This paper examines the gap between reliability and construct validity when using LLMs as coding instruments for theoretical constructs, and proposes grain calibration as a method to decompose constructs into clause-level components for more valid measurement.
This paper argues against the 'retire-and-replace' approach to saturated benchmarks, using CORE-Bench as a case study to demonstrate that measuring agent performance along dimensions such as construct validity, efficiency, reliability, and human-agent collaboration yields meaningful insights even after accuracy plateaus.
This Systematization of Knowledge paper proposes a unified Multi-Trait Multi-Method (MTMM) geometric framework for evaluating Large Language Models, unifying disparate metrics into a shared latent coordinate space to address construct validity issues in current benchmarks.
This paper critiques the 'Proxy Presumption' in NLP, where geometric embedding properties are incorrectly equated with social constructs. It introduces the Construct Validity Protocol and Counterfactual Neutralization methods to ensure rigorous validation of social measures derived from semantic embeddings.