construct-validity

#construct-validity

Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

arXiv cs.CL ↗ · 5d ago Cached

This paper examines the gap between reliability and construct validity when using LLMs as coding instruments for theoretical constructs, and proposes grain calibration as a method to decompose constructs into clause-level components for more valid measurement.

0 favorites 0 likes

#construct-validity

Life After Benchmark Saturation: A Case Study of CORE-Bench

arXiv cs.AI ↗ · 2026-06-26 Cached

This paper argues against the 'retire-and-replace' approach to saturated benchmarks, using CORE-Bench as a case study to demonstrate that measuring agent performance along dimensions such as construct validity, efficiency, reliability, and human-agent collaboration yields meaningful insights even after accuracy plateaus.

0 favorites 0 likes

#construct-validity

Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

arXiv cs.CL ↗ · 2026-05-12 Cached

This Systematization of Knowledge paper proposes a unified Multi-Trait Multi-Method (MTMM) geometric framework for evaluating Large Language Models, unifying disparate metrics into a shared latent coordinate space to address construct validity issues in current benchmarks.

0 favorites 0 likes

#construct-validity

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper critiques the 'Proxy Presumption' in NLP, where geometric embedding properties are incorrectly equated with social constructs. It introduces the Construct Validity Protocol and Counterfactual Neutralization methods to ensure rigorous validation of social measures derived from semantic embeddings.

0 favorites 0 likes

construct-validity

Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

Life After Benchmark Saturation: A Case Study of CORE-Bench

Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

Submit Feedback