The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

arXiv cs.CL Papers

Summary

This paper critiques the 'Proxy Presumption' in NLP, where geometric embedding properties are incorrectly equated with social constructs. It introduces the Construct Validity Protocol and Counterfactual Neutralization methods to ensure rigorous validation of social measures derived from semantic embeddings.

arXiv:2605.07409v1 Announce Type: new Abstract: Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ($C$) and confounding attributes ($Z$) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite -- including tests for discriminant, incremental, and predictive validity -- this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:02 AM

# The Proxy Presumption: From Semantic Embeddings to Valid Social Measures
Source: [https://arxiv.org/html/2605.07409](https://arxiv.org/html/2605.07409)
###### Abstract

Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such asnovelty,creativity, andbias\. However, this transition faces a fundamental validity challenge: the “Proxy Presumption,” or the reliance on geometric properties \(e\.g\., cosine distance\) as direct measures of social concepts\. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct \(CC\) and confounding attributes \(ZZ\) like topic, style, and authorship\. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol \(CVP\)\. Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification\. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space\. By providing a standardized Validity Suite—including tests for discriminant, incremental, and predictive validity—this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments\.

## 1Introduction

The core ambition of contemporary Natural Language Processing \(NLP\) has expanded beyond predicting the next token; we are increasingly tasked with measuring the social world\. Recent literature has pioneered the quantification of abstract social constructs, proposing computational metrics for concepts such asnovelty,creativity, andbias\(Merrill et al\.,[2024](https://arxiv.org/html/2605.07409#bib.bib51); Lee et al\.,[2024](https://arxiv.org/html/2605.07409#bib.bib45); Bang et al\.,[2024](https://arxiv.org/html/2605.07409#bib.bib7)\)\. This transition marks a pivotal moment: NLP is evolving from an engineering discipline into a primary measurement instrument for Computational Social Science \(CSS\)\.

However, this ambition introduces a methodological challenge: the gap betweentheoretical constructs\(what we intend to measure\) andcomputational proxies\(what we implement\)\. In the absence of established ground truth for latent social variables, the field faces a risk we term theProxy Presumption: a convenient geometric property—most commonly cosine similarity in embedding space—is renamed as a theoretical construct without sufficient validation\(Caliskan et al\.,[2017](https://arxiv.org/html/2605.07409#bib.bib12); May et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib49); Bolukbasi et al\.,[2016](https://arxiv.org/html/2605.07409#bib.bib11); Garg et al\.,[2018](https://arxiv.org/html/2605.07409#bib.bib28); Kozlowski et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib43)\)\. We emphasize that this is a recurring pattern across a substantial and influential subset of NLP/ML measurement work rather than an isolated mistake, while recognizing meaningful variation in how rigorously different papers validate their proxies\. While vector distance captures semantic divergence, equating it directly with “innovation” or “creativity” assumes a strong isomorphism between embedding space geometry and social reality that may not hold\.

While measurement validity is a general issue in machine learning, the proxy presumption is especially acute in NLP because text is the primary unstructured medium for computational social science, and major nuisance dimensions \(ZZ\) in text are inherently linguistic \(register, dialect, pragmatics, style\)\. Moreover, many interventions that can reduce confounding are text\-native, including counterfactual rewriting and language\-aware debiasing methods\(Elazar and Goldberg,[2018](https://arxiv.org/html/2605.07409#bib.bib22); Ravfogel et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib57); Blodgett et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib10); Sap et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib60)\)\.

This paper argues that such “measurement by renaming” can be formalized as a non\-identification problem: unsupervised embeddings compress multiple generative factors into a single representation, entangling the target construct \(CC\) with nuisance attributes \(ZZ\) such as topic, author style, register, or lengthLocatello et al\. \([2019](https://arxiv.org/html/2605.07409#bib.bib48)\); Schölkopf et al\. \([2021](https://arxiv.org/html/2605.07409#bib.bib61)\)\. Without explicit disentanglement or design\-based controls, standard metrics may primarily capture method variance rather than the intended social phenomenonZhou et al\. \([2022](https://arxiv.org/html/2605.07409#bib.bib68)\)\. For example, a high cosine distance may reflect a genuine conceptual break, or it may simply reflect a vocabulary shift\. Even with perfect disentanglement, the current practice still suffers from the mis\-specification issue: the function \(such as cosine similarity\) chosen to represent the new concept may not be the most suitable function that represents the functional relationship betweenCCand the document, excluding the influences from confounding covariatesZZ\.

To address this challenge and support the maturation of CSS, we introduce theConstruct Validity Protocol \(CVP\)\. Adapted from standards in psychometrics and causal representation learningBengio et al\. \([2013](https://arxiv.org/html/2605.07409#bib.bib9)\); Locatello et al\. \([2019](https://arxiv.org/html/2605.07409#bib.bib48)\); Schölkopf et al\. \([2021](https://arxiv.org/html/2605.07409#bib.bib61)\), the CVP provides a roadmap for validating social variables in NLP\. Our contributions are fourfold:

1. 1\.A non\-identification result:We formalize why unsupervised metrics cannot reliably recover latent constructs without explicit assumptions or interventionsLocatello et al\. \([2019](https://arxiv.org/html/2605.07409#bib.bib48)\); Schölkopf et al\. \([2021](https://arxiv.org/html/2605.07409#bib.bib61)\)\.
2. 2\.Methodological solutions:We proposeCounterfactual Neutralizationand connect it to existing debiasing/disentanglement tools such as adversarial removal and nullspace projectionElazar and Goldberg \([2018](https://arxiv.org/html/2605.07409#bib.bib22)\); Ravfogel et al\. \([2020](https://arxiv.org/html/2605.07409#bib.bib57)\)\.
3. 3\.A Validity Suite:We define a standardized suite of evaluations \(stability/reliability, convergent validity, discriminant and incremental validity, known\-groups checks, and criterion\-related evidence\) to test whether a proxy tracks the intended construct beyond nuisance dimensions\.
4. 4\.A forensic literature review:We analyze recent publications \(2021–2025\) to document the reuse of nearly identical similarity\-based instruments for distinct, sometimes contradictory, constructs\.

#### Scope and positioning\.

This paper is a*position and synthesis*contribution: we propose CVP as a community\-facing reporting protocol for measurement identification, not a final or exclusive standard\. We present CVP as a starting point that should be iterated as evidence accumulates and as NLP\-specific failure modes are better understood, analogous in spirit to evolving reporting guidelines in adjacent areas \(e\.g\., Data Statements for NLP;Bender and Friedman \([2018](https://arxiv.org/html/2605.07409#bib.bib8)\)\)\.

Our goal is not to discourage the measurement of complex social phenomena, but to provide tools that render such measurements scientifically defensible\. In Computational Social Science \(CSS\), these constructed variables are often critical inputs for downstream causal inference or predictive tasks\. However, if the variable construction itself lacks validity, no causal ML methods can rigorously establish downstream causality\.

## 2Related Literature: The State of Social Measurement

Our work connects three research traditions that are often cited separately but rarely integrated in NLP: \(i\) measurement theory and construct validity from the social sciences, \(ii\) causal representation learning and identifiability, and \(iii\) recent NLP methodology debates about what models and annotations actually measure\.

### 2\.1Measurement Theory and Construct Validity\.

Measurement theory distinguishesconstructs\(theoretical concepts\) frommeasures\(observable indicators\)\. As codified in standard social science guidelinesDeVellis \([2016](https://arxiv.org/html/2605.07409#bib.bib19)\); Adcock and Collier \([2001](https://arxiv.org/html/2605.07409#bib.bib1)\), validation is not a definitional act but an empirical one: researchers must demonstrate that a measure behaves as predicted within a “nomological network,” correlating with related concepts \(convergent validity\) while remaining distinct from nuisances \(discriminant validity\)\. This rigorous framework is increasingly relevant to NLP as the field targets complex social variables\. A growing body of work now operationalizes constructs through explicitdomains of observables—for example, decomposing persuasion into specific resistance strategies or mapping social norms to defined interactional patternsVijjini et al\. \([2024](https://arxiv.org/html/2605.07409#bib.bib67)\); Huang and Yang \([2023](https://arxiv.org/html/2605.07409#bib.bib37)\)\. These studies exemplify the “construct\-to\-measure” pipeline we formalize, standing in contrast to scalar metrics that prioritize prediction without separating signal from confounding noise\.

### 2\.2Validity in NLP: Annotation, Bias, and the Limits of Proxying\.

A parallel line of work argues that many NLP variables inherit ambiguity from their measurement process\. Human label variation is increasingly recognized not as annotator error, but as a meaningful signal of linguistic ambiguity and subjectivity\(Pavlick and Kwiatkowski,[2019](https://arxiv.org/html/2605.07409#bib.bib54); Plank,[2022](https://arxiv.org/html/2605.07409#bib.bib55); Davani et al\.,[2022](https://arxiv.org/html/2605.07409#bib.bib16); Aroyo and Welty,[2015](https://arxiv.org/html/2605.07409#bib.bib3)\)\. Furthermore, dataset curation and annotation protocols can embed structural artifacts and degenerate shortcuts into text, yielding variables that appear to measure a target construct while in fact tracking annotation or sampling regularities\(Gururangan et al\.,[2018](https://arxiv.org/html/2605.07409#bib.bib32); Poliak et al\.,[2018](https://arxiv.org/html/2605.07409#bib.bib56); McCoy et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib50); Bender and Friedman,[2018](https://arxiv.org/html/2605.07409#bib.bib8); Geiger et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib29)\)\. Methodological critiques in fairness and social NLP emphasize that quantities like “toxicity” and “bias” are often operationalized as convenient proxies that suffer from construct slippage and normative mismatch, including dialect and genre confounds\(Blodgett et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib10); Sap et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib60)\)\. This perspective complements our claim that measurement failures are frequentlymethod\-variance failures: what appears to be a social construct may largely reflect dataset composition, genre, dialect, or frequency artifacts\. Our CVP makes these critiques actionable by requiring explicit confound tests, invariance checks aligned with established NLP evaluation practice\(Ribeiro et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib59); Gardner et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib27)\), and transparent reporting of what the proxy is sensitive to\.

### 2\.3Causal Representation Learning and Non\-Identification\.

We ground our framework in Causal Representation LearningSchölkopf et al\. \([2021](https://arxiv.org/html/2605.07409#bib.bib61)\), which establishes a fundamental non\-identification result: without explicit structural assumptions or interventions, it is theoretically impossible to disentangle latent causal factors from observational data alone\. We apply this constraint to social measurement: extracting a valid construct \(CC\) from text requires actively modeling and neutralizing nuisance factors \(ZZ\), rather than assuming that an unsupervised embedding will spontaneously isolate them\.

Beyond representation learning, a complementary literature studies when text\-derived variables can support downstream causal inference\(Egami et al\.,[2022](https://arxiv.org/html/2605.07409#bib.bib20)\)and how to use imperfect surrogates for downstream inference via design\-based estimators\(Egami et al\.,[2023](https://arxiv.org/html/2605.07409#bib.bib21)\)\. Learned\-proxy methodology in political science further emphasizes that post\-hoc correlation is insufficient for testing causal theories with learned proxies\(Knox et al\.,[2022](https://arxiv.org/html/2605.07409#bib.bib41)\)\. We position CVP as addressing the measurement\-validity prerequisite that these downstream frameworks typically assume\.

## 3The Impossibility of Unsupervised Inverse Construction

Current computational social science often relies on the assumption that a specific latent social constructCC\(e\.g\., a hate sentiment score\) can be recovered directly from observed documentsDDusing a fixed proxy functionf​\(D\)f\(D\)\. We argue that this formulation constitutes anill\-posed inverse problemformally analogous to the identification problem in causal inference\.

### 3\.1Case I: Single\-Document Measurement

Let the generation of a documentD∈𝒟D\\in\\mathcal\{D\}be modeled as a stochastic process dependent on two disjoint sets of latent factors: ascalartarget constructc∈ℝc\\in\\mathbb\{R\}and a vector of nuisance attributes𝐳∈ℝk\\mathbf\{z\}\\in\\mathbb\{R\}^\{k\}\(e\.g\.,Topic,Length,Authorship\)\. We model the document generation process as a draw from a conditional probability distribution parameterized byθ\\theta\.

pθ​\(D∣c,𝐳\)p\_\{\\theta\}\(D\\mid c,\\mathbf\{z\}\)\(1\)The standard measurement pipeline attempts to invert this process:

1. 1\.Representation Learning:An encoderE:𝒟→ℝdE:\\mathcal\{D\}\\to\\mathbb\{R\}^\{d\}maps the text to an embedding space𝐞\\mathbf\{e\}\.
2. 2\.Proxy Construction:A functionf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}is applied to estimate the scalar construct:c^=f​\(𝐞\)\\hat\{c\}=f\(\\mathbf\{e\}\)\.

Ideally,f​\(E​\(D\)\)f\(E\(D\)\)should recovercc\. However, as established in the representation learning literatureLocatello et al\. \([2019](https://arxiv.org/html/2605.07409#bib.bib48)\); Schölkopf et al\. \([2021](https://arxiv.org/html/2605.07409#bib.bib61)\), this recovery is theoretically impossible without structural assumptions\.

#### Proposition 1 \(Rotational Ambiguity\)\.

Let the joint latent space be𝐡=\[c;𝐳\]\\mathbf\{h\}=\[c;\\mathbf\{z\}\]with a factorized isotropic Gaussian priorp​\(𝐡\)=𝒩​\(0,I\)p\(\\mathbf\{h\}\)=\\mathcal\{N\}\(0,I\)\. For any unsupervised objective that maximizes the likelihood of the datap​\(D\)p\(D\), the learned representation is identifiable only up to an arbitrary orthogonal rotation\. Consequently, the scalar dimensionccis inextricably mixed with the nuisance vector𝐳\\mathbf\{z\}\.

#### Proof Sketch\.

FollowingLocatello et al\. \([2019](https://arxiv.org/html/2605.07409#bib.bib48)\), consider the marginal likelihoodp​\(D\)=∫pθ​\(D∣𝐡\)​p​\(𝐡\)​𝑑𝐡p\(D\)=\\int p\_\{\\theta\}\(D\\mid\\mathbf\{h\}\)p\(\\mathbf\{h\}\)d\\mathbf\{h\}\. LetRRbe any arbitrary orthogonal matrix \(R⊤​R=IR^\{\\top\}R=I\)\. We define a transformed latent space𝐡′=R​𝐡\\mathbf\{h\}^\{\\prime\}=R\\mathbf\{h\}\. Due to the isotropy of the Gaussian prior, the density is rotation\-invariant:p​\(𝐡′\)=p​\(𝐡\)p\(\\mathbf\{h\}^\{\\prime\}\)=p\(\\mathbf\{h\}\)\.

Now, consider an alternative generator defined byG′​\(𝐱\)=G​\(R⊤​𝐱\)G^\{\\prime\}\(\\mathbf\{x\}\)=G\(R^\{\\top\}\\mathbf\{x\}\)\. If we assume the data generation process uses latent variables𝐡′\\mathbf\{h\}^\{\\prime\}and generatorG′G^\{\\prime\}, the observation isD=G′​\(𝐡′\)=G​\(R⊤​\(R​𝐡\)\)=G​\(𝐡\)D=G^\{\\prime\}\(\\mathbf\{h\}^\{\\prime\}\)=G\(R^\{\\top\}\(R\\mathbf\{h\}\)\)=G\(\\mathbf\{h\}\)\. Thus, the model\(G′,𝐡′\)\(G^\{\\prime\},\\mathbf\{h\}^\{\\prime\}\)yields the exact same observational distributionp​\(D\)p\(D\)as the original model\(G,𝐡\)\(G,\\mathbf\{h\}\)\.

Since the unsupervised objective depends only onp​\(D\)p\(D\), it cannot distinguish between the entangled basis𝐡′\\mathbf\{h\}^\{\\prime\}\(where the first dimension is a linear combination ofccand𝐳\\mathbf\{z\}\) and the disentangled basis𝐡\\mathbf\{h\}\. In other words, any𝐡′=α​c\+β⊤​𝐳\\mathbf\{h\}^\{\\prime\}=\\alpha c\+\\beta^\{\\top\}\\mathbf\{z\}is a possible latent factor vector\. Thus,ccis not identified\.

#### Implication\.

This proves that statistical independence does not imply disentanglementSchölkopf et al\. \([2021](https://arxiv.org/html/2605.07409#bib.bib61)\)\. Even ifc⟂𝐳c\\perp\\mathbf\{z\}in the real world, the learned embedding𝐞\\mathbf\{e\}can arbitrarily rotate the basis such that the dimension used for measurement is a mixture of signal and noise\. Thus, any proxyc^=f​\(𝐞\)\\hat\{c\}=f\(\\mathbf\{e\}\)suffers from intrinsic method variance\. We will illustrate this claim with an example in the two\-document case\.

### 3\.2Case II: Two\-Document Measurement

Constructing variables from pairwise document embeddings introduces an additional layer of entanglement\. Consider measuring the relationship between two documents, such asScientific Novelty\(CC\), defined as the distance between a new paperDiD\_\{i\}and prior workDjD\_\{j\}\. The key nuisance is topical overlap: both documents may share a dominant topic \(e\.g\., “Deep Learning”\), which can account for most surface\-level lexical and semantic mass\. Standard practice then applies a similarity or distance function \(e\.g\., cosine similarity\) to the*full*embeddings, implicitly allowing this nuisance component to dominate the measurement\.

Even if we assume an oracle encoderEEthat produces a perfectly disentangled concatenation𝐞=\[𝐜⊕𝐳\]\\mathbf\{e\}=\[\\mathbf\{c\}\\oplus\\mathbf\{z\}\], where𝐜\\mathbf\{c\}encodes the concept of interest and𝐳\\mathbf\{z\}encodes nuisance variation, similarity\-based measurement can still fail for two reasons:

1. \(I\)Target mismatch\.The metric of interest should be applied to the concept embeddings\(𝐜i,𝐜j\)\(\\mathbf\{c\}\_\{i\},\\mathbf\{c\}\_\{j\}\), but is typically applied to the full vectors\(𝐞i,𝐞j\)\(\\mathbf\{e\}\_\{i\},\\mathbf\{e\}\_\{j\}\), allowing variance in𝐳\\mathbf\{z\}to dominate\.
2. \(II\)Metric indeterminacy\.Even if𝐜\\mathbf\{c\}were perfectly isolated, there is no theoretical guarantee that a particular geometric function \(e\.g\., cosine similarity versus Euclidean distance\) is the correct proxy for an abstract construct\. Determining the appropriate functional form requires empirical validation \(Section 4\)\.

We now illustrate issue \(I\) by expanding two common choices of metrics applied to the full embedding𝐞\\mathbf\{e\}\.

Cosine similarity \(normalization effect\)\.

cos⁡\(𝐞i,𝐞j\)=𝐜i⋅𝐜j\+𝐳i⋅𝐳j‖𝐜i‖2\+‖𝐳i‖2​‖𝐜j‖2\+‖𝐳j‖2\\cos\(\\mathbf\{e\}\_\{i\},\\mathbf\{e\}\_\{j\}\)=\\frac\{\\mathbf\{c\}\_\{i\}\\cdot\\mathbf\{c\}\_\{j\}\+\\mathbf\{z\}\_\{i\}\\cdot\\mathbf\{z\}\_\{j\}\}\{\\sqrt\{\\\|\\mathbf\{c\}\_\{i\}\\\|^\{2\}\+\\\|\\mathbf\{z\}\_\{i\}\\\|^\{2\}\}\\;\\sqrt\{\\\|\\mathbf\{c\}\_\{j\}\\\|^\{2\}\+\\\|\\mathbf\{z\}\_\{j\}\\\|^\{2\}\}\}\(2\)When‖𝐳‖≫‖𝐜‖\\\|\\mathbf\{z\}\\\|\\gg\\\|\\mathbf\{c\}\\\|, the denominator is dominated by the nuisance component, so high cosine similarity largely reflects “same topic” rather than the intended construct \(e\.g\., “novel conceptual contribution”\)\.

Euclidean distance \(additive decomposition issue\)\.

‖𝐞i−𝐞j‖2=‖𝐜i−𝐜j‖2\+‖𝐳i−𝐳j‖2\\\|\\mathbf\{e\}\_\{i\}\-\\mathbf\{e\}\_\{j\}\\\|^\{2\}=\\\|\\mathbf\{c\}\_\{i\}\-\\mathbf\{c\}\_\{j\}\\\|^\{2\}\+\\\|\\mathbf\{z\}\_\{i\}\-\\mathbf\{z\}\_\{j\}\\\|^\{2\}\(3\)Because the contributions ofΔ​𝐜\\Delta\\mathbf\{c\}andΔ​𝐳\\Delta\\mathbf\{z\}are purely additive, a large distance can arise from genuine conceptual deviation \(Δ​C\\Delta C\) or from nuisance shifts such as subfield jargon or topical drift \(Δ​Z\\Delta Z\), and the metric alone does not identify which source drives the score\.

### 3\.3Strategies for Alleviation

Because perfect disentanglement from observational text is not guaranteed without additional structure or assumptions, we outline a practical, multi\-level strategy that intervenes at three points in the pipeline: \(i\) the input text, \(ii\) the learned representation, and \(iii\) the scoring function\. The goal is to approximate measurement validity by reducing dependence on nuisance factorsZZwhile preserving information relevant to the constructCC\.

#### Level 1: Input disentanglement \(preprocessingDD\)\.

The most direct intervention is to reduce the dependence of the observed documentDDon nuisance factors before representation learning\. Concretely, we transformD→D′D\\rightarrow D^\{\\prime\}to preserve construct\-relevant content while attenuating nuisance variation\.Targeted information extraction \(LLM\-based\):rather than embedding the full document, one can use an LLM as a constrained extractor to retain only spans that are logically relevant to the construct definition\. For example, when measuringmission hybridity, one can extract mission statements and strategic objectives while discarding biographical details, formatting artifacts, or boilerplate that primarily contribute toZZ\.Stylistic normalization:because embeddings are sensitive to length, register, and complexity, inputs can be normalized to a canonical form \(e\.g\., comparable length, standardized structure, or neutral tone\) to reduce stylistic nuisance components \(e\.g\.,ZstyleZ\_\{\\text\{style\}\}\)\.Entity masking \(anonymization\):when the construct is conceptual \(e\.g\.,policy stance\) but nuisance variation includes entity identity \(e\.g\., organization names, demographic markers\), embedding models may rely on entity co\-occurrence rather than the intended abstract relation\. Replacing named entities with placeholders \(e\.g\.,\[ORG\],\[PERSON\]\) can reduce entity\-driven shortcuts and encourage reliance on structural and contextual cues\.

#### Level 2: Representation disentanglement \(learning in the𝐞\\mathbf\{e\}space\)\.

If the representation itself is entangled, we can impose objectives that explicitly suppressZZwhile preservingCC\.Supervised/adversarial removal:when labels for nuisance factors are available \(e\.g\., topic IDs, author attributes, domain indicators\), adversarial learning can discourage encodings ofZZin𝐞\\mathbf\{e\}\(e\.g\., Adversarial Removal;Elazar and Goldberg,[2018](https://arxiv.org/html/2605.07409#bib.bib22)\), and projection\-based methods can iteratively identify and remove subspaces predictive ofZZ\(e\.g\., Iterative Nullspace Projection;Ravfogel et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib57)\)\.Contrastive learning:when one can construct paired examples that preserveCCwhile varyingZZ, contrastive objectives can encourage embeddings to align with construct\-invariant features rather than nuisance variation\.

#### Level 3: Methodological disentanglement \(designing the scoring functionff\)\.

When residual entanglement remains, the measurement function itself can be designed to cancel nuisance contributions\.Counterfactual neutralization:instead of reporting a static scoref​\(𝐞obs\)f\(\\mathbf\{e\}\_\{\\text\{obs\}\}\), define measurement as a differential relative to a baseline embedding:

C^=f​\(𝐞obs\)−f​\(𝐞base\),\\hat\{C\}=f\(\\mathbf\{e\}\_\{\\text\{obs\}\}\)\-f\(\\mathbf\{e\}\_\{\\text\{base\}\}\),\(4\)where𝐞base\\mathbf\{e\}\_\{\\text\{base\}\}is the embedding of a counterfactual “neutralized” version of the same text in which the construct signal is minimized \(e\.g\., rewriting to remove stance/novelty claims while preserving topical content\)\. The subtraction aims to remove the baseline contribution attributable to nuisance factorsZZ, leaving a score more sensitive to the construct\-relevant component\.

## 4The Construct Validity Protocol \(CVP\)

Construct validity is not merely a quality check; it is a prerequisite for identification\. In both social science and machine learning, isolating a target factor from nuisance variation requires additional structure \(e\.g\., design constraints or validation evidence\), much like causal representation learning requires structural assumptions to separate latent causes from noise\(Locatello et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib48); Schölkopf et al\.,[2021](https://arxiv.org/html/2605.07409#bib.bib61)\)\. For embedding\-based measurement, the core requirement is that variation in the proxyC^\\widehat\{C\}is attributable to the intended constructCC, rather than superficial regularities such as topic, style, or length \(ZZ\)\(Grimmer and Stewart,[2013](https://arxiv.org/html/2605.07409#bib.bib31); Adcock and Collier,[2001](https://arxiv.org/html/2605.07409#bib.bib1)\)\. Without this form of*measurement identification*, downstream inferences are fragile: we cannot interpret effects of a variable that has not been credibly isolated\. We therefore propose theConstruct Validity Protocol \(CVP\)as a standard operating procedure that moves from an abstract concept to a testable measurement instrument\.

### 4\.1Overview

CVP has three phases\. Phase 1 establishes a construct specification and qualitative alignment\. Phase 2 emphasizes instrument design choices that reduce entanglement with nuisance factors\. Phase 3 reports a structured validity suite—aValidity Card—that documents stability and quantitative evidence \(convergent, discriminant/incremental, and criterion\-related checks\)\.

### 4\.2Phase 1: Conceptualization

Before any modeling, the researcher should establishface and content validitythrough expert scrutiny\(Adcock and Collier,[2001](https://arxiv.org/html/2605.07409#bib.bib1); DeVellis,[2016](https://arxiv.org/html/2605.07409#bib.bib19); Lawshe,[1975](https://arxiv.org/html/2605.07409#bib.bib44); Haynes et al\.,[1995](https://arxiv.org/html/2605.07409#bib.bib35)\)\. This phase clarifies what the construct is*and is not*, and what nuisance dimensions must be neutralized\.

#### Deliverables\.

Construct map:Provide a concise definition ofCCand enumerate the primary nuisance dimensionsZZ\(e\.g\., domain/topic vocabulary, style/register, length, venue/time\)\.Facet blueprint:Specify a short list of facets ofCC\(inclusion/exclusion criteria\) and the types of textual evidence expected for each facet\.Exemplar set:Curate a three\-tier anchor set of known high\-CC, low\-CC, and borderline/mid\-spectrum documents \(“known\-groups” anchors\) that are broadly agreed upon within the research community; borderline anchors are used to probe facet boundaries and decision rules\.

#### Face\-validity audit\.

Run the full measurement pipeline on the exemplar set and inspect the highest\- and lowest\-scoring cases, then examine borderline anchors to probe decision boundaries\. Independent domain experts should judge whether the ranking is plausible and, crucially, which cues appear to drive the scores\. Systematic disagreements \(including non\-ideal false positives and false negatives on the anchors\) indicate construct–measure misalignment and should trigger revisions before quantitative validation\.

### 4\.3Phase 2: Operationalization

Rather than defaulting to black\-box similarity scores, Phase 2 focuses on designing instruments that reduce entanglement withZZand better isolateCC\. Section 3 outlines concrete intervention points \(input preprocessing, representation learning, and scoring\-function design\), which can be instantiated as task\-specific measurement instruments\.

### 4\.4Phase 3: The Validity Suite

Validity is not a single statistic but an argument supported by multiple, conceptually distinct sources of evidence\(Cronbach and Meehl,[1955](https://arxiv.org/html/2605.07409#bib.bib15); Campbell and Fiske,[1959](https://arxiv.org/html/2605.07409#bib.bib13); Messick,[1995](https://arxiv.org/html/2605.07409#bib.bib52); American Educational Research Association et al\.,[2014](https://arxiv.org/html/2605.07409#bib.bib2)\)\. This is especially important for embedding\-based measures, where instability and nuisance leakage are common\(Grimmer and Stewart,[2013](https://arxiv.org/html/2605.07409#bib.bib31); Gentzkow et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib30)\)\. We therefore recommend reporting a structuredValidity Cardthat documents \(at minimum\) stability and discriminant/incremental evidence, and \(when feasible\) convergent and criterion\-related evidence\.111Measurement theory distinguishes multiple families of evidence, including content/face evidence, reliability/stability, convergent and discriminant evidence \(external structure\), and criterion\-related evidence\(Campbell and Fiske,[1959](https://arxiv.org/html/2605.07409#bib.bib13); Messick,[1995](https://arxiv.org/html/2605.07409#bib.bib52); American Educational Research Association et al\.,[2014](https://arxiv.org/html/2605.07409#bib.bib2)\)\. We prioritize*\(i\) stability*as a prerequisite and*\(ii\) discriminant \+ incremental validity*as the most diagnostic for ruling out topic/style surrogacy in embedding pipelines; known\-groups and criterion\-related checks are recommended complements\.

#### Validity Card 1: Reliability / Stability \(prerequisite\)\.

Even deterministic encoders can yield unstable measures due to preprocessing choices, chunking, or prompt/instrument sensitivity\. If small, defensible perturbations materially change the resulting variable, downstream validity claims become specification\-dependent\(Koo and Li,[2016](https://arxiv.org/html/2605.07409#bib.bib42); Nunnally and Bernstein,[1994](https://arxiv.org/html/2605.07409#bib.bib53)\)\.

Recommended perturbations\.Swap between close encoder variants; paraphrase prompts; vary neutralization prompts; vary embedding aggregation rules\.

What to report\.\(i\) a clearly enumerated perturbation set and the number of variants per document; \(ii\)*absolute\-agreement*reliability via ICC \(report the ICC form, e\.g\., ICC\(2,1\) or ICC\(2,kk\)\)\(Shrout and Fleiss,[1979](https://arxiv.org/html/2605.07409#bib.bib62); Koo and Li,[2016](https://arxiv.org/html/2605.07409#bib.bib42)\)\. As a rule of thumb, ICC≥0\.75\\geq 0\.75is often interpreted as “good” and ICC≥0\.90\\geq 0\.90as “excellent”\(Koo and Li,[2016](https://arxiv.org/html/2605.07409#bib.bib42)\)\.222Pearson correlation can be misleading for stability because it is invariant to shifts and rescalings\. ICC for absolute agreement penalizes systematic shifts in level across perturbations\.

#### Validity Card 2: Convergent validity \(agreement with an independent measure\)\.

DoesC^i\\widehat\{C\}\_\{i\}correlate with an independent measure of the same construct \(e\.g\., expert ratings using a rubric that operationalizesCC\)? This follows multitrait–multimethod logic: measures of the same trait should agree more than measures of different traits\(Campbell and Fiske,[1959](https://arxiv.org/html/2605.07409#bib.bib13)\)\.

Cig​o​l​d=α\+βc​o​n​v​C^i\+ϵiC^\{gold\}\_\{i\}=\\alpha\+\\beta\_\{conv\}\\,\\widehat\{C\}\_\{i\}\+\\epsilon\_\{i\}\(5\)
What to report\.\(i\) reliability ofCg​o​l​dC^\{gold\}\(e\.g\., ICC or Krippendorff’sα\\alpha\), since convergent evidence is bounded by gold reliability\(Spearman,[1904](https://arxiv.org/html/2605.07409#bib.bib65); Koo and Li,[2016](https://arxiv.org/html/2605.07409#bib.bib42)\); \(ii\) effect size \(standardizedβc​o​n​v\\beta\_\{conv\}or correlation\) with confidence interval and a diagnostic plot; and \(iii\) if possible, out\-of\-sample convergence via a held\-out gold set or cross\-validation\(Hastie et al\.,[2009](https://arxiv.org/html/2605.07409#bib.bib34)\)\. Effects around 0\.10 are often described as small, 0\.30 moderate, and 0\.50\+ large\(Cohen,[1988](https://arxiv.org/html/2605.07409#bib.bib14)\)\.

#### Validity Card 3: Discriminant \+ incremental validity \(primary diagnostic\)\.

A central threat is thatC^\\widehat\{C\}becomes a proxy for topic, style, venue, or time rather than the intended construct\(Grimmer and Stewart,[2013](https://arxiv.org/html/2605.07409#bib.bib31); Gentzkow et al\.,[2019](https://arxiv.org/html/2605.07409#bib.bib30)\)\. Discriminant validity asks whetherC^\\widehat\{C\}is distinct from nuisance dimensions\. Incremental validity asks whetherC^\\widehat\{C\}adds explanatory power beyond nuisances for an*external validation label*LiL\_\{i\}that directly operationalizesCC\(often human\-coded\)\.

Step 1 \(discriminant\)\.Test whetherZZmechanically explainsC^\\widehat\{C\}:

C^i=α\+∑kγk​Zk,i\+ϵi\.\\widehat\{C\}\_\{i\}=\\alpha\+\\sum\_\{k\}\\gamma\_\{k\}Z\_\{k,i\}\+\\epsilon\_\{i\}\.\(6\)
Step 2 \(incremental\)\.Test whetherC^\\widehat\{C\}adds signal for an external validation labelLiL\_\{i\}beyondZZ:

Li\\displaystyle L\_\{i\}=α\+∑kδk​Zk,i\+ui,\\displaystyle=\\alpha\+\\sum\_\{k\}\\delta\_\{k\}Z\_\{k,i\}\+u\_\{i\},\(7\)Li\\displaystyle L\_\{i\}=α\+βi​n​c​C^i\+∑kδk​Zk,i\+ui\.\\displaystyle=\\alpha\+\\beta\_\{inc\}\\,\\widehat\{C\}\_\{i\}\+\\sum\_\{k\}\\delta\_\{k\}Z\_\{k,i\}\+u\_\{i\}\.\(8\)
What to report\.Pre\-specify nuisance blocks \(topic proxies, length/style measures, venue/field/year fixed effects\) and report \(i\) predictability ofC^\\widehat\{C\}fromZZ\(e\.g\., cross\-validatedR2R^\{2\}\) with block\-wise contributions; and \(ii\) incremental evidence via standardizedβi​n​c\\beta\_\{inc\}with confidence interval and out\-of\-sampleΔ​R2\\Delta R^\{2\}from addingC^\\widehat\{C\}to the nuisance\-only model\(Hastie et al\.,[2009](https://arxiv.org/html/2605.07409#bib.bib34); Hunsley and Meyer,[2003](https://arxiv.org/html/2605.07409#bib.bib38)\)\. High predictability ofC^\\widehat\{C\}from topic/FE blocks is a warning sign of surrogacy; instability ofβi​n​c\\beta\_\{inc\}across reasonable nuisance sets is a red flag\(Grimmer and Stewart,[2013](https://arxiv.org/html/2605.07409#bib.bib31); Cohen,[1988](https://arxiv.org/html/2605.07409#bib.bib14)\)\.

#### Validity Card 4: Known\-groups validity \(optional\)\.

Using the Phase 1 exemplar set, test whether the measure separates pre\-specified high\-CCversus low\-CCgroups:

C^i=α\+τ​1​\{i∈High\-​C\}\+ϵi\.\\widehat\{C\}\_\{i\}=\\alpha\+\\tau\\,\\mathbb\{1\}\\\{i\\in\\text\{High\-\}C\\\}\+\\epsilon\_\{i\}\.\(9\)What to report\.Separation effect \(e\.g\., Cohen’sddwith confidence interval\) and a distribution plot \(e\.g\., ECDF\)\. As a rule of thumb,d≈0\.2d\\approx 0\.2is small,0\.50\.5medium, and0\.80\.8large\(Cohen,[1988](https://arxiv.org/html/2605.07409#bib.bib14)\)\.

#### Validity Card 5: Criterion\-related / predictive evidence \(complement\)\.

IfCCis substantively meaningful,C^\\widehat\{C\}should relate to downstream outcomes in the theorized direction\. For validity testing, prioritize interpretability and attribution: a linear specification provides a conservative baseline that tests whetherC^\\widehat\{C\}has independent signal beyond strong nuisance controls\.

Yt\+k,i=α\+βp​r​e​d​C^i\+𝚪​Controlsi\+ϵi\.Y\_\{t\+k,i\}=\\alpha\+\\beta\_\{pred\}\\,\\widehat\{C\}\_\{i\}\+\\mathbf\{\\Gamma\}\\text\{Controls\}\_\{i\}\+\\epsilon\_\{i\}\.\(10\)
What to report\.Standardizedβp​r​e​d\\beta\_\{pred\}with confidence interval and the incremental explanatory power from addingC^\\widehat\{C\}beyond strong nuisance controls\. When possible, include a negative\-control outcomeYplaceboY\_\{\\text\{placebo\}\}that should not be predicted byCC; strong prediction of implausible outcomes suggests dataset artifacts\(Lipsitch et al\.,[2010](https://arxiv.org/html/2605.07409#bib.bib46)\)\.

### 4\.5Worked Example

To demonstrate that CVP is operational, we executed it on GoEmotions\(Demszky et al\.,[2020](https://arxiv.org/html/2605.07409#bib.bib18)\)\(43,410 train / 5,426 dev / 5,427 test\)\. GoEmotions is vector\-valued; for brevity we instantiate one coordinate \(*gratitude*\) withLi=1L\_\{i\}=1iff the human label set contains gratitude\. We focus on Cards 1 and 3 since they directly test stability and nuisance entanglement; the remaining cards typically require additional independent instruments or external outcomes beyond GoEmotions\.

#### Card 1\.

We instantiate the recommended perturbations by swapping close encoders \(all\-MiniLM\-L6\-v2vsall\-MiniLM\-L12\-v2\), varying aggregation \(mean vs CLS pooling\), and applying a simple style\-normalization \(original vs lowercased, punctuation\-stripped text\), yieldingk=8k=8variants \(2 encoders×\\times2 pooling rules×\\times2 normalization conditions\) onn=2000n=2000test examples\. Each variant yields a reasonable proxy \(AUC range 0\.9407–0\.9662\)\. Treating variants as “raters,” absolute\-agreement reliability is ICC\(2,1\)=0\.8467 and ICC\(2,kk\)=0\.9779\(Shrout and Fleiss,[1979](https://arxiv.org/html/2605.07409#bib.bib62); Koo and Li,[2016](https://arxiv.org/html/2605.07409#bib.bib42)\)\.

#### Card 3\.

LetC^i=p​\(Li=1∣xi\)\\widehat\{C\}\_\{i\}=p\(L\_\{i\}=1\\mid x\_\{i\}\)be a concrete proxy from sentence embeddings\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.07409#bib.bib58)\)\. We form nuisance blocksZZfrom \(i\) length/style features \(token/character length, “\!”/“?” counts, uppercase ratio\) and \(ii\) a topic block \(TF–IDF \+ SVD\)\. Discriminant regression \(Eq\. \(6\)\) showsC^\\widehat\{C\}is largely nuisance\-recoverable \(R2=0\.0245R^\{2\}\{=\}0\.0245from length/style only;0\.77620\.7762from topic only;0\.77680\.7768from fullZZ\)\. Incremental prediction \(Eq\. \(7\)–\(8\)\) improves from AUC 0\.9658 \(Z→LZ\{\\to\}L\) to 0\.9831 \(Z\+C^→LZ\{\+\}\\widehat\{C\}\{\\to\}L\), withβinc\>0\\beta\_\{\\mathrm\{inc\}\}\{\>\}0\.

#### Anchors and errors \(high/low/borderline\)\.

High\-CC: “Thanks\!\!”, “Thank you\!”, “Thanks”; Low\-CC: “No I’m not”, “No no she was \[NAME\]”, “Why? I love it\.” Borderline: “Thank \[NAME\] for the person walking down the sidewalk…”, “Did you kill the shark…Thanks \[NAME\]\!”; a salient FP is “Yeah thank you you ungrateful bast…”, and FNs include “That’s because \[NAME\] thinks he’s the man…” and “Daily would be great, but…work…”\.

## 5Forensic Analysis: The Jangle Fallacy in Practice

To assess how often social NLP work substantiates the leap from an embedding\-based proxy to a theoretically valid measure, we conducted a forensic coding of 17 influential papers \(2020–2025\) in ACL/NAACL/EMNLP that propose computational metrics for abstract social constructs\. Our goal is not to single out individual contributions, but to evaluate whether the literature, in aggregate, separates \(i\) an*empirical instrument*\(a scoring function over text representations\) from \(ii\) the*theoretical construct*it claims to measure\. Following standard measurement theory, we coded each paper on seven dimensions of validity and identification: construct definition, face/content evidence, reliability/stability, convergent validity, discriminant validity, predictive/criterion validity, and handling of confounding covariates \(see Appendix A for coding criteria\)\.

#### Results\.

Table[1](https://arxiv.org/html/2605.07409#S5.T1)summarizes the distribution of evidence\. Two patterns stand out\. First, many papers provide*conceptual framing*and basic plausibility checks: construct definition is frequently articulated \(Construct Validity: 10 Yes, 7 Partial, 0 No\), and most papers provide some face/content evidence \(6 Yes, 11 Partial, 0 No\), often in the form of annotation guidelines, curated examples, or qualitative sanity checks\. Second, the evidence most diagnostic for*measurement identification*is comparatively scarce\. Convergent validity is rarely established against an independent instrument at the same unit of analysis \(1 Yes, 12 Partial, 4 No\)\. Discriminant validity is never fully demonstrated in the strict sense of ruling out nuisance surrogacy \(0 Yes, 11 Partial, 6 No\)\. Predictive/criterion validity under an external\-outcome criterion is uncommon \(1 Yes, 3 Partial, 13 No\)\. Finally, none of the sampled papers uses causal identification methods to isolate the construct from confounds; confounding is handled primarily through heuristics or associational controls \(0 Yes, 14 Partial, 3 No\)\. Reliability/stability is the most consistently reported quantitative check \(11 Yes, 4 Partial, 2 No\), reflecting the routine reporting of inter\-annotator agreement and, less commonly, robustness to perturbations\.

#### Interpretation\.

This evidence profile is consistent with the paper’s central mechanism\. When a measure is implemented as a geometric function over an entangled representation, three threats dominate: \(i\) the score may track nuisance structure such as topic, style, length, or prompt phrasing; \(ii\) the mapping from geometry to construct is underdetermined without validation \(metric indeterminacy\); and \(iii\) observed associations can be driven by confounding rather than construct variation\. The forensic table suggests that current practice more often documents that a proxy is*defined*and*stable enough to compute*than that it*isolates*the intended construct\.

More concretely, “Partial” evidence in the sample frequently takes forms that are compatible with the proxy presumption\. Convergent validity is often operationalized as triangulation with related proxies, alignment with expectations, or comparisons to adjacent tasks—useful checks, but weaker than agreement with an independent validated instrument\. Discriminant validity is commonly addressed through design constraints \(e\.g\., topic balancing, prompt standardization\) or qualitative discussion, rather than explicit tests showing the score is not explained by nuisance variables*with respect to nuisance confounds*\. Confounding is typically handled via filtering, matching, or including covariates in predictive models, which controls associationally but does not support identification without additional assumptions\.

#### Implication\.

These gaps matter for cumulative science\. If the literature rarely demonstrates discriminant validity and confound isolation, then two papers that both report a measure of “bias” or “ideology” can plausibly be tracking different mixtures of construct and nuisance variance\. In that regime, results may appear to cumulate under a shared construct label while the underlying variables are not commensurate—a textbook setting for the*jangle fallacy*\(Kelley,[1927](https://arxiv.org/html/2605.07409#bib.bib40)\)\. The forensic results therefore support the paper’s broader claim: downstream utility and qualitative plausibility are often treated as sufficient, even though the evidence most necessary to distinguish construct variance from method variance remains limited\. This motivates the Construct Validity Protocol in Section 4 as a minimum reporting standard for embedding\-based social measures\.

Table 1:Validity evidence across 17 social measurement papers \(2020–2025\)\. “Yes” indicates rigorous adherence to measurement standards; “Partial” indicates heuristic or indirect checks; “No” indicates that the dimension is not substantively evaluated\.

## 6Alternative Views

#### Objection 1: “Post\-hoc Correlation is Sufficient\.”

A common counter\-argument is that if a proxy correlates with a human label, disentanglement is unnecessary\. We disagree\. A high correlation coefficient \(e\.g\.,r=0\.7r=0\.7\) can be driven entirely by a confounding variableZZ\. For example, a “Toxicity” classifier might correlate well with human labels simply because both humans and models flag AAVE dialect as toxicZhou et al\. \([2022](https://arxiv.org/html/2605.07409#bib.bib68)\)\. Without counterfactual validity \(proving the score changes only whenCCchanges, not just whenZZchanges\), the metric is a biased estimator of the construct\.

More broadly, learned\-proxy methodology shows that post\-hoc correlation is insufficient for testing causal theories with learned proxies\(Knox et al\.,[2022](https://arxiv.org/html/2605.07409#bib.bib41)\)\.

#### Objection 2: “Large Scale Solves This\.”

Another view posits that sufficiently large models \(e\.g\., GPT\-4\) implicitly understand the difference between concepts and nuisance variables\. While LLMs are powerful generators, their embeddings remain entangled representations of the training distribution\. As demonstrated by the oracle encoder fallacy, a perfect encoder preservesallinformation, including the noise\. Scale improves thefidelityof the embedding, but it does not automatically perform thecausal abstractionrequired to separateCCfromZZ\. Validity requires active methodological intervention, not just passive scaling\.

## 7Conclusion

The migration of NLP from engineering benchmarks to Computational Social Science demands a parallel maturation in how we*measure*\. We have argued that much of today’s construct\-oriented work implicitly relies on “measurement by renaming”—treating convenient geometric heuristics \(e\.g\., cosine distance in embedding space\) as if they were identified measures of latent social constructs\. By making the measurement problem explicit in a simple data\-generating view,D=G​\(C,Z\)D=G\(C,Z\), we show why this practice is fragile: unsupervised representations generally encode mixtures of the target construct \(CC\) and nuisance attributes \(ZZ\) such as topic, style, venue, time, and authorship\. In this setting, raw similarity metrics are not guaranteed to isolateCCand can instead track pipeline\-induced variation\.

To move from plausible proxies to cumulative measurement, we propose theConstruct Validity Protocol \(CVP\)as a community standard\. The CVP operationalizes a full pipeline: \(i\)conceptualizationwith explicit construct boundaries and a domain of observables, \(ii\)face/content validationthrough exemplar design and expert audit, \(iii\)instrument designthat targets construct\-relevant text while controlling confounds, and \(iv\) aValidity Suitethat reports reliability/stability, convergent evidence against independent labels, discriminant and*incremental*evidence beyond topic/style controls, known\-groups separation, and predictive/criterion tests with falsification outcomes\. Within this framework, we introducedCounterfactual Neutralization—using LLM\-generated counterfactual rewrites to holdZZfixed while varying construct\-relevant content—and complementary tools such as orthogonal projection when appropriate\.

Our aim is not to prohibit proxies, but to make them*testable*\. Accuracy is about hitting the target; construct validity is about ensuring the target is the right one\. Embedding\-based “social variables” should therefore be accompanied by transparent validity evidence—a compactValidity Cardthat documents design choices, stability checks, and the full set of diagnostics\. Adopting such standards would make results comparable across papers, reduce the risk of topic/style leakage masquerading as social signal, and ultimately enable NLP\-based measurement to serve as credible scientific evidence rather than convenient geometry\.

### Limitations

Our coding reflects what is*reported*in papers rather than what authors may have performed but did not document\. This is a feature—because cumulative science depends on transparent validity arguments—but it also means we may undercount validity evidence that exists only in unpublished analyses, code repositories, or informal checks\. Moreover, our coarse \{Yes/Partial/No\} labels compress a spectrum of practices, and several dimensions \(especially predictive/criterion validity and confound isolation\) depend on definitional choices about what constitutes a sufficiently external criterion or an identification strategy\. Future work could improve reliability by preregistering the rubric, double\-coding with independent raters, and reporting inter\-rater agreement\.

Second, this paper is a position and synthesis contribution: we propose the Construct Validity Protocol \(CVP\) as a standard for measurement identification, but we do not instantiate the full protocol end\-to\-end on a new dataset\. Empirical case studies that apply the CVP prospectively—including controlled discriminant tests and confound\-neutralization designs—are needed to quantify the practical costs, failure modes, and benefits of the protocol in real measurement pipelines, especially as LLM\-based encoders and prompting practices continue to evolve rapidly\.

## References

- Adcock and Collier \(2001\)Robert Adcock and David Collier\. 2001\.Measurement validity: A shared standard for qualitative and quantitative research\.*American political science review*, 95\(3\):529–546\.
- American Educational Research Association et al\. \(2014\)American Educational Research Association, American Psychological Association, and National Council on Measurement in Education\. 2014\.*Standards for Educational and Psychological Testing*\.American Educational Research Association\.
- Aroyo and Welty \(2015\)Lora Aroyo and Chris Welty\. 2015\.Truth is a lie: Crowd truth and the seven myths of human annotation\.*AI magazine*, 36\(1\):15–24\.
- Azizov et al\. \(2024\)Dilshod Azizov, Zain Muhammad Mujahid, Hilal AlQuabeh, Preslav Nakov, and Shangsong Liang\. 2024\.SAFARI: Cross\-lingual bias and factuality detection in news media and news articles\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 12217–12231\. Association for Computational Linguistics\.
- Azzopardi and Moshfeghi \(2025\)Leif Azzopardi and Yashar Moshfeghi\. 2025\.POW: Political overton windows of large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 24767–24773\. Association for Computational Linguistics\.
- Baly et al\. \(2020\)Ramy Baly, Giovanni Da San Martino, James Glass, and Preslav Nakov\. 2020\.We can detect your bias: Predicting the political ideology of news articles\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*, pages 4982–4991\.
- Bang et al\. \(2024\)Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung\. 2024\.Measuring political bias in large language models: What is said and how it is said\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 11142–11159\.
- Bender and Friedman \(2018\)Emily M Bender and Batya Friedman\. 2018\.Data statements for natural language processing: Toward mitigating system bias and enabling better science\.*Transactions of the Association for Computational Linguistics*, 6:587–604\.
- Bengio et al\. \(2013\)Yoshua Bengio, Aaron Courville, and Pascal Vincent\. 2013\.Representation learning: A review and new perspectives\.*IEEE transactions on pattern analysis and machine intelligence*, 35\(8\):1798–1828\.
- Blodgett et al\. \(2020\)Su Lin Blodgett, Solon Barocas, Hal Daumé Iii, and Hanna Wallach\. 2020\.Language \(technology\) is power: A critical survey of “bias” in nlp\.In*Proceedings of the 58th annual meeting of the association for computational linguistics*, pages 5454–5476\.
- Bolukbasi et al\. \(2016\)Tolga Bolukbasi, Kai\-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai\. 2016\.Man is to computer programmer as woman is to homemaker? debiasing word embeddings\.*Advances in Neural Information Processing Systems*, 29\.
- Caliskan et al\. \(2017\)Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan\. 2017\.Semantics derived automatically from language corpora contain human\-like biases\.*Science*, 356\(6334\):183–186\.
- Campbell and Fiske \(1959\)Donald T Campbell and Donald W Fiske\. 1959\.Convergent and discriminant validation by the multitrait\-multimethod matrix\.*Psychological bulletin*, 56\(2\):81\.
- Cohen \(1988\)Jacob Cohen\. 1988\.*Statistical Power Analysis for the Behavioral Sciences*\.Lawrence Erlbaum Associates\.
- Cronbach and Meehl \(1955\)Lee J Cronbach and Paul E Meehl\. 1955\.Construct validity in psychological tests\.*Psychological bulletin*, 52\(4\):281\.
- Davani et al\. \(2022\)Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran\. 2022\.Dealing with disagreements: Looking beyond the majority vote in subjective annotations\.*Transactions of the Association for Computational Linguistics*, 10:92–110\.
- Demszky et al\. \(2021\)Dorottya Demszky, Jing Liu, Zid Mancenido, Julie Cohen, Heather Hill, Dan Jurafsky, and Tatsunori B Hashimoto\. 2021\.Measuring conversational uptake: A case study on student\-teacher interactions\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 1638–1653\.
- Demszky et al\. \(2020\)Dorottya Demszky, Dana Movshovitz\-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi\. 2020\.Goemotions: A dataset of fine\-grained emotions\.In*Proceedings of the 58th annual meeting of the association for computational linguistics*, pages 4040–4054\.
- DeVellis \(2016\)Robert F\. DeVellis\. 2016\.*Scale Development: Theory and Applications*\.Sage Publications\.
- Egami et al\. \(2022\)Naoki Egami, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Brandon M Stewart\. 2022\.How to make causal inferences using texts\.*Science Advances*, 8\(42\):eabg2652\.
- Egami et al\. \(2023\)Naoki Egami, Musashi Hinck, Brandon Stewart, and Hanying Wei\. 2023\.Using imperfect surrogates for downstream inference: Design\-based supervised learning for social science applications of large language models\.*Advances in Neural Information Processing Systems*, 36:68589–68601\.
- Elazar and Goldberg \(2018\)Yanai Elazar and Yoav Goldberg\. 2018\.Adversarial removal of demographic attributes from text data\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 11–21\.
- ElSherief et al\. \(2021\)Mai ElSherief, Caleb Ziems, David Muchlinski, Vaishnavi Anupindi, Jordyn Seybolt, Munmun De Choudhury, and Diyi Yang\. 2021\.Latent hatred: A benchmark for understanding implicit hate speech\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 345–363\.
- Faulborn et al\. \(2025\)Mats Faulborn, Indira Sen, Max Pellert, Andreas Spitz, and David Garcia\. 2025\.Only a little to the left: A theory\-grounded measure of political bias in large language models\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 31684–31704\.
- Feng et al\. \(2023\)Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov\. 2023\.From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 11737–11762\.
- Gabriel et al\. \(2022\)Saadia Gabriel, Skyler Hallinan, Maarten Sap, Pemi Nguyen, Franziska Roesner, Eunsol Choi, and Yejin Choi\. 2022\.Misinfo reaction frames: Reasoning about readers’ reactions to news headlines\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 3108–3127\.
- Gardner et al\. \(2020\)Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, and 1 others\. 2020\.Evaluating models’ local decision boundaries via contrast sets\.In*Findings of the Association for Computational Linguistics: EMNLP 2020*, pages 1307–1323\.
- Garg et al\. \(2018\)Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou\. 2018\.Word embeddings quantify 100 years of gender and ethnic stereotypes\.*Proceedings of the National Academy of Sciences*, 115\(16\):E3635–E3644\.
- Geiger et al\. \(2020\)R Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang\. 2020\.Garbage in, garbage out? do machine learning application papers in social computing report where human\-labeled training data comes from?In*Proceedings of the 2020 conference on fairness, accountability, and transparency*, pages 325–336\.
- Gentzkow et al\. \(2019\)Matthew Gentzkow, Bryan Kelly, and Matt Taddy\. 2019\.Text as data\.*Journal of Economic Literature*, 57\(3\):535–574\.
- Grimmer and Stewart \(2013\)Justin Grimmer and Brandon M Stewart\. 2013\.Text as data: The promise and pitfalls of automatic content analysis methods for political texts\.*Political analysis*, 21\(3\):267–297\.
- Gururangan et al\. \(2018\)Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith\. 2018\.Annotation artifacts in natural language inference data\.In*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\)*, pages 107–112\.
- Harel\-Canada et al\. \(2024\)Fabrice Y Harel\-Canada, Hanyu Zhou, Sreya Muppalla, Zeynep Senahan Yildiz, Miryung Kim, Amit Sahai, and Nanyun Peng\. 2024\.Measuring psychological depth in language models\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 17162–17196\.
- Hastie et al\. \(2009\)Trevor Hastie, Robert Tibshirani, Jerome Friedman, and 1 others\. 2009\.The elements of statistical learning\.
- Haynes et al\. \(1995\)Stephen N Haynes, David Richard, and Edward S Kubany\. 1995\.Content validity in psychological assessment: A functional approach to concepts and methods\.*Psychological assessment*, 7\(3\):238\.
- Hoover et al\. \(2020\)Joe Hoover, Gwenyth Portillo\-Wightman, Leigh Yeh, Shreya Havaldar, Aida Mostafazadeh Davani, Ying Lin, Brendan Kennedy, Mohammad Atari, Zahra Kamel, Madelyn Mendlen, and 1 others\. 2020\.Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment\.*Social Psychological and Personality Science*, 11\(8\):1057–1071\.
- Huang and Yang \(2023\)Jing Huang and Diyi Yang\. 2023\.Culturally aware natural language inference\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 7591–7609\.
- Hunsley and Meyer \(2003\)John Hunsley and Gregory J Meyer\. 2003\.The incremental validity of psychological testing and assessment: conceptual, methodological, and statistical issues\.*Psychological assessment*, 15\(4\):446\.
- Ils et al\. \(2021\)Alexandra Ils, Dan Liu, Daniela Grunow, and Steffen Eger\. 2021\.Changes in european solidarity before and during covid\-19: Evidence from a large crowd\-and expert\-annotated twitter dataset\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 1623–1637\.
- Kelley \(1927\)Truman Lee Kelley\. 1927\.*Interpretation of educational measurements*\.World Book Company\.
- Knox et al\. \(2022\)Dean Knox, Christopher Lucas, and Wendy K Tam Cho\. 2022\.Testing causal theories with learned proxies\.*Annual Review of Political Science*, 25:419–441\.
- Koo and Li \(2016\)Terry K Koo and Mae Y Li\. 2016\.A guideline of selecting and reporting intraclass correlation coefficients for reliability research\.*Journal of chiropractic medicine*, 15\(2\):155–163\.
- Kozlowski et al\. \(2019\)Austin C Kozlowski, Matt Taddy, and James A Evans\. 2019\.The geometry of culture: Analyzing the meanings of class through word embeddings\.*American Sociological Review*, 84\(5\):905–949\.
- Lawshe \(1975\)Charles H Lawshe\. 1975\.A quantitative approach to content validity\.*Personnel psychology*, 28\(4\)\.
- Lee et al\. \(2024\)Junha Lee, Jaeshin Cho, Youngjin Cho, Hyewon Jin, Hyemin Lee, and Min Song\. 2024\.[ICScore: Metrics for evaluating interestingness and creativity of stories](https://neurips.cc/virtual/2024/107877)\.Poster, NeurIPS 2024 Workshop: Statistical Frontiers in LLMs and Foundation Models\.
- Lipsitch et al\. \(2010\)Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen\. 2010\.Negative controls: a tool for detecting confounding and bias in observational studies\.*Epidemiology*, 21\(3\):383–388\.
- Liu et al\. \(2022\)Yujian Liu, Xinliang Frederick Zhang, David Wegsman, Nicholas Beauchamp, and Lu Wang\. 2022\.Politics: Pretraining with same\-story article comparison for ideology prediction and stance detection\.In*Findings of the Association for Computational Linguistics: NAACL 2022*, pages 1354–1374\.
- Locatello et al\. \(2019\)Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem\. 2019\.Challenging common assumptions in the unsupervised learning of disentangled representations\.In*international conference on machine learning*, pages 4114–4124\. PMLR\.
- May et al\. \(2019\)Chandler May, Alex Wang, Shikha Bordia, Samuel Bowman, and Rachel Rudinger\. 2019\.On measuring social biases in sentence encoders\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\)*, pages 622–628\.
- McCoy et al\. \(2019\)R Thomas McCoy, Ellie Pavlick, and Tal Linzen\. 2019\.Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference\.In*Proceedings of the 57th annual meeting of the association for computational linguistics*, pages 3428–3448\.
- Merrill et al\. \(2024\)William Merrill, Noah A Smith, and Yanai Elazar\. 2024\.Evaluating n\-gram novelty of language models using rusty\-dawg\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 14459–14473\.
- Messick \(1995\)Samuel Messick\. 1995\.Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning\.*American psychologist*, 50\(9\):741\.
- Nunnally and Bernstein \(1994\)J\.C\. Nunnally and I\.H\. Bernstein\. 1994\.[*Psychometric Theory*](https://books.google.com.sg/books?id=r0fuAAAAMAAJ)\.Number no\. 972 in McGraw\-Hill series in psychology\. McGraw\-Hill Companies,Incorporated\.
- Pavlick and Kwiatkowski \(2019\)Ellie Pavlick and Tom Kwiatkowski\. 2019\.Inherent disagreements in human textual inferences\.*Transactions of the Association for Computational Linguistics*, 7:677–694\.
- Plank \(2022\)Barbara Plank\. 2022\.The “problem” of human label variation: On ground truth in data, modeling and evaluation\.In*Proceedings of the 2022 conference on Empirical Methods in Natural Language Processing*, pages 10671–10682\.
- Poliak et al\. \(2018\)Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme\. 2018\.Hypothesis only baselines in natural language inference\.In*Proceedings of the seventh joint conference on lexical and computational semantics*, pages 180–191\.
- Ravfogel et al\. \(2020\)Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg\. 2020\.Null it out: Guarding protected attributes by iterative nullspace projection\.In*Proceedings of the 58th annual meeting of the association for computational linguistics*, pages 7237–7256\.
- Reimers and Gurevych \(2019\)Nils Reimers and Iryna Gurevych\. 2019\.Sentence\-bert: Sentence embeddings using siamese bert\-networks\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 3982–3992\.
- Ribeiro et al\. \(2020\)Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh\. 2020\.Beyond accuracy: Behavioral testing of nlp models with checklist\.In*Proceedings of the 58th annual meeting of the association for computational linguistics*, pages 4902–4912\.
- Sap et al\. \(2019\)Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith\. 2019\.The risk of racial bias in hate speech detection\.In*Proceedings of the 57th annual meeting of the Association for Computational Linguistics*, pages 1668–1678\.
- Schölkopf et al\. \(2021\)Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio\. 2021\.Toward causal representation learning\.*Proceedings of the IEEE*, 109\(5\):612–634\.
- Shrout and Fleiss \(1979\)Patrick E Shrout and Joseph L Fleiss\. 1979\.Intraclass correlations: uses in assessing rater reliability\.*Psychological bulletin*, 86\(2\):420\.
- Sinno et al\. \(2022\)Barea Sinno, Bernardo Oviedo, Katherine Atwell, Malihe Alikhani, and Junyi Jessy Li\. 2022\.Political ideology and polarization: A multi\-dimensional approach\.In*Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 231–243\.
- Sky et al\. \(2023\)CH\-Wang Sky, Arkadiy Saakyan, Oliver Li, Zhou Yu, and Smaranda Muresan\. 2023\.Sociocultural norm similarities and differences via situational alignment and explainable textual entailment\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 3548–3564\.
- Spearman \(1904\)Charles Spearman\. 1904\.The proof and measurement of association between two things\.*The American Journal of Psychology*, 15\(1\):72–101\.
- Vidgen et al\. \(2021\)Bertie Vidgen, Tristan Thrush, Zeerak Talat, and Douwe Kiela\. 2021\.Learning from the worst: Dynamically generated datasets to improve online hate detection\.In*Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\)*, pages 1667–1682\.
- Vijjini et al\. \(2024\)Anvesh Rao Vijjini, Rakesh R Menon, Jiayi Fu, Shashank Srivastava, and Snigdha Chaturvedi\. 2024\.Socialgaze: Improving the integration of human social norms in large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 16487–16506\.
- Zhou et al\. \(2022\)Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, and Dan Jurafsky\. 2022\.Problems with cosine as a measure of embedding similarity for high frequency words\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pages 401–423\.

## Appendix AMeasurement Validity Coding Scheme and Rubric

This appendix specifies the coding rubric used in the forensic analysis \(Appendix[B](https://arxiv.org/html/2605.07409#A2)\)\. Each paper is evaluated on seven dimensions of measurement validity and identification\. For each dimension, we assign one of three labels:Yes,Partial, orNo\.

### A\.1General Scoring Labels

- •Yes\(High rigor\)\.The method meets premium social science standards\. It is grounded in established theory and/or uses rigorous psychometric validation or explicit causal designs, with clear empirical evidence\.
- •Partial\(Medium rigor\)\.The method relies on operational heuristics, “silver standards,” or ad\-hoc sanity checks\. It may acknowledge threats \(e\.g\., confounding\) but addresses them with standard supervised learning, informal qualitative inspection, or design constraints rather than explicit identification\.
- •No\(Low rigor\)\.The authors provide no discussion, definition, or empirical validation for the dimension\.

### A\.2Dimensions of Validity

#### Dimension 1: Target Variable Definition \(Construct Validity\)\.

- •Yes\.Cites a specific, pre\-existing social science theory \(e\.g\., Moral Foundations Theory\) and defines the construct independently of the dataset and measurement procedure \(e\.g\., “We measure ideology as conceptually defined by Converse \(1964\) …”\)\.
- •Partial\.Defines the variable operationally, tautologically, or based primarily on dataset labels or a single chosen proxy \(e\.g\., “toxicity is whatever the Perspective API labels as toxic,” or “bias is cosine distance between vectors”\)\.
- •No\.No clear definition is provided; the variable is treated as self\-explanatory\.

#### Dimension 2: Face Validity \(Content Validity\)\.

- •Yes\.Conducts a formal content validity study prior to deployment, such as expert\-panel review of items/lexicon or a structured pilot with domain experts\.
- •Partial\.Provides informal sanity checks \(e\.g\., selected high/low scoring examples, qualitative inspection of top features/words\) without a structured expert evaluation\.
- •No\.No inspection of instrument content and no qualitative examples are provided\.

#### Dimension 3: Reliability / Stability\.

- •Yes\.Reports formal, chance\-corrected reliability/stability metrics\. For human annotation: Cohen’s/Fleiss’κ\\kappa, ICC, or Krippendorff’sα\\alpha\. For model\-based instruments: test–retest reliability or rigorous robustness checks \(e\.g\., prompt perturbations/paraphrasing\) with appropriate agreement metrics\.
- •Partial\.Reports weak or non\-corrected metrics \(e\.g\., raw % agreement without chance correction\), or evaluates stability only on a small/non\-representative subset\.
- •No\.No reliability metric or stability check is reported\.

#### Dimension 4: Convergent Validity\.

- •Yes\.Demonstrates significant association with a completely independent, external “gold\-standard” measure of the same construct at the appropriate unit of analysis \(e\.g\., “Our ideology score correlates with DW\-NOMINATE roll\-call votes”\)\.
- •Partial\.Correlates with a “silver standard,” a related\-but\-distinct proxy, or internal metadata that is suggestive but not a gold\-standard instrument \(e\.g\., star ratings, hashtags, outlet categories\)\.
- •No\.No external correlation or comparison is reported\.

#### Dimension 5: Discriminant Validity\.

- •Yes\.Explicitly tests and empirically demonstrates that the measure is distinct from nuisance factors \(e\.g\., near\-zero correlation with document length, explicit topic residualization, or formal tests against style/genre/demographic/prompt effects\)\.
- •Partial\.Acknowledges potential confounds and/or offers qualitative arguments or design constraints, but does not provide formal tests sufficient to rule out nuisance surrogacy\.
- •No\.No discussion or test of discriminant validity\.

#### Dimension 6: Predictive Validity\.

- •Yes\.Predicts a real\-world downstream outcome external to the annotation/task setting \(e\.g\., “polarization predicts future protest violence,” “trust predicts trading volume”\)\.
- •Partial\.Predicts an internal proxy outcome or trivial metadata label \(e\.g\., predicting subreddit labels, publication year, or other dataset\-internal fields\)\.
- •No\.No predictive/criterion task is reported\.

#### Dimension 7: Handling Confounding Covariates\.

- •Yes\.Uses explicit causal inference methods to isolate the construct from confounding covariatesZZ\(e\.g\., instrumental variables, double machine learning, propensity score matching, or rigorous residualization under stated assumptions\)\.
- •Partial\.Uses heuristics to exclude/limitZZ, or uses standard supervised learning that includesZZas a feature \(associative “controls” without causal identification\)\.
- •No\.No discussion of confounding covariates; treats relationships as direct without controls\.

## Appendix BPaper\-Level Forensic Coding Notes

This appendix reports the paper\-level coding notes underlying Table[1](https://arxiv.org/html/2605.07409#S5.T1)in the main text\. We code 17 influential ACL/NAACL/EMNLP papers \(2020–2025\) that propose computational measures of abstract social constructs using seven dimensions: \(D1\) target\-variable definition \(construct validity\), \(D2\) face/content validity, \(D3\) reliability/stability, \(D4\) convergent validity, \(D5\) discriminant validity, \(D6\) predictive/criterion validity, and \(D7\) handling confounding covariates\. Each dimension is coded asYes,Partial, orNofollowing the rubric in Appendix[A](https://arxiv.org/html/2605.07409#A1)\.

### B\.1Paper\-Level Coding Notes \(D1–D7\)

#### Paper 1:[Demszky et al\.](https://arxiv.org/html/2605.07409#bib.bib18)\(ACL[2020](https://arxiv.org/html/2605.07409#bib.bib18)\) — Fine\-grained Emotions \(Reddit\)\.

- •D1 \(Construct\):Partial\.Label set motivated by psychology literature and a careful selection process, but the construct definition remains closely tied to the taxonomy rather than an independent theory\-grounded construct map\.
- •D2 \(Face\):Partial\.Presents labeled examples \(e\.g\., example annotations\), but no formal expert content\-validity audit is reported\.
- •D3 \(Reliability\):Yes\.Reports rater agreement metrics including chance\-corrected statistics \(e\.g\., Cohen’sκ\\kappa\) alongside corroborating checks\.
- •D4 \(Convergent\):Partial\.Shows transfer/generalization to existing emotion benchmarks; useful triangulation but not a classic external gold\-standard convergence test\.
- •D5 \(Discriminant\):No\.No explicit tests that emotion labels are distinct from nuisance factors \(topic/style/length\)\.
- •D6 \(Predictive\):No\.Validations focus on label prediction rather than external real\-world criteria\.
- •D7 \(Confounding\):Partial\.Identifies confounds and applies curation measures, but no causal identification strategy is used\.

#### Paper 2:[Baly et al\.](https://arxiv.org/html/2605.07409#bib.bib6)\(EMNLP[2020](https://arxiv.org/html/2605.07409#bib.bib6)\) — Political Bias and Factuality \(Outlets \+ Social Profiles\)\.

- •D1 \(Construct\):Partial\.Targets defined primarily through MBFC label categories rather than an independent theory definition\.
- •D2 \(Face\):Partial\.Informal sanity checks \(e\.g\., excluding ill\-defined label categories\), but no structured content\-validity study\.
- •D3 \(Reliability\):No\.Reliability of the inherited labels is not quantified in the paper\.
- •D4 \(Convergent\):No\.Evaluations mainly predict the same label scheme; no external convergence at the same unit of analysis\.
- •D5 \(Discriminant\):No\.Feature ablations are not framed as discriminant tests against nuisances\.
- •D6 \(Predictive\):No\.No external outcome prediction beyond the dataset labels\.
- •D7 \(Confounding\):Partial\.Practical heuristics \(filters/exclusions/multi\-source signals\), but no causal identification methods\.

#### Paper 3:[Ils et al\.](https://arxiv.org/html/2605.07409#bib.bib39)\(ACL[2021](https://arxiv.org/html/2605.07409#bib.bib39)\) — Social Solidarity / Anti\-Solidarity \(Twitter\)\.

- •D1 \(Construct\):Yes\.Provides an explicit, theory\-grounded definition of solidarity\.
- •D2 \(Face\):Yes\.Involves social\-science experts in refinement/adjudication\.
- •D3 \(Reliability\):Yes\.Reports Cohen’sκ\\kappaand compares agreement across conditions\.
- •D4 \(Convergent\):No\.No convergence test against an external solidarity index/instrument\.
- •D5 \(Discriminant\):Partial\.Reports limited discriminant evidence \(e\.g\., low correlation with sentiment\), but no systematic nuisance\-factor testing\.
- •D6 \(Predictive\):Partial\.Correlates the construct with external indicators \(e\.g\., COVID\-19 rates\) at aggregate level; suggestive but not a dedicated predictive validation\.
- •D7 \(Confounding\):Partial\.Notes sampling and causal ambiguity; no identification strategy is implemented\.

#### Paper 4:[Demszky et al\.](https://arxiv.org/html/2605.07409#bib.bib17)\(ACL[2021](https://arxiv.org/html/2605.07409#bib.bib17)\) — Conversational Uptake \(Student–Teacher Transcripts\)\.

- •D1 \(Construct\):Yes\.Defines uptake as a linguistic/social construct and provides an operational definition\.
- •D2 \(Face\):Partial\.Offers qualitative plausibility checks and examples but no formal content\-validity study\.
- •D3 \(Reliability\):Yes\.Reports structured annotation and inter\-rater agreement with explicit statistics; describes aggregation\.
- •D4 \(Convergent\):Yes\.Tests alignment with indicators expected to track uptake\.
- •D5 \(Discriminant\):Partial\.Shows uptake is not reducible to repetition/overlap, but does not fully rule out broader nuisances\.
- •D6 \(Predictive\):Yes\.Links uptake to downstream outcomes relevant to teaching contexts \(e\.g\., satisfaction/quality\)\.
- •D7 \(Confounding\):Partial\.Addresses some topical overlap concerns but does not implement causal identification\.

#### Paper 5:[Vidgen et al\.](https://arxiv.org/html/2605.07409#bib.bib66)\(ACL[2021](https://arxiv.org/html/2605.07409#bib.bib66)\) — Online Hate \(Adversarial Dataset\)\.

- •D1 \(Construct\):Yes\.Provides an explicit definition of hate for the annotation task\.
- •D2 \(Face\):Yes\.Emphasizes expert annotators and presents credibility signals for labeling\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected agreement \(e\.g\., Krippendorff’sα\\alpha\) across rounds\.
- •D4 \(Convergent\):Partial\.Evaluates against an external functional test suite \(triangulation rather than gold\-standard correlation\)\.
- •D5 \(Discriminant\):Partial\.Uses hard negatives/contrastive design to reduce keyword confounds, but does not provide formal discriminant statistics\.
- •D6 \(Predictive\):No\.Focuses on dataset/model evaluation, not external real\-world outcomes\.
- •D7 \(Confounding\):Partial\.Mitigates confounds via data design, but no causal identification strategy is used\.

#### Paper 6:[ElSherief et al\.](https://arxiv.org/html/2605.07409#bib.bib23)\(EMNLP[2021](https://arxiv.org/html/2605.07409#bib.bib23)\) — Latent Hatred / Implicit Hate \(Twitter\)\.

- •D1 \(Construct\):Yes\.Develops a typology targeting implicit hate beyond explicit slurs\.
- •D2 \(Face\):Partial\.Taxonomy and examples support plausibility, but no formal expert panel content\-validity study is reported\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected reliability \(e\.g\., ICC, Fleiss’κ\\kappa\)\.
- •D4 \(Convergent\):Partial\.Compares against widely used systems; informative but not an external gold\-standard construct instrument\.
- •D5 \(Discriminant\):Partial\.Distinguishes implicit from explicit hate; limited evidence against adjacent nuisances \(offensiveness/negativity\)\.
- •D6 \(Predictive\):No\.Evaluations are mainly benchmarking/model performance\.
- •D7 \(Confounding\):Partial\.Uses heuristics \(e\.g\., filtering/keyword\-related controls\) to focus on the implicit construct, but not causal identification\.

#### Paper 7:[Liu et al\.](https://arxiv.org/html/2605.07409#bib.bib47)\(NAACL[2022](https://arxiv.org/html/2605.07409#bib.bib47)\) — Ideology and Stance \(News; POLITICS pretraining\)\.

- •D1 \(Construct\):Partial\.Ideology operationalized via outlet\-level labels rather than theory\-grounded definition\.
- •D2 \(Face\):Partial\.Informal plausibility checks \(e\.g\., attention visualization\), but no structured content\-validity procedure\.
- •D3 \(Reliability\):No\.No reliability metrics reported for outlet\-level ideology labels\.
- •D4 \(Convergent\):Partial\.Uses third\-party outlet ratings and tests downstream utility, but no independent political\-science ideology measure at same unit\.
- •D5 \(Discriminant\):Partial\.Limited topic\-control evidence; no systematic nuisance testing\.
- •D6 \(Predictive\):No\.Focuses on ideology prediction/benchmarks rather than external outcomes\.
- •D7 \(Confounding\):Partial\.Uses heuristics to reduce imbalance/bias; no causal identification methods\.

#### Paper 8:[Sinno et al\.](https://arxiv.org/html/2605.07409#bib.bib63)\(NAACL[2022](https://arxiv.org/html/2605.07409#bib.bib63)\) — Multi\-dimensional Political Ideology \(News Paragraphs\)\.

- •D1 \(Construct\):Yes\.Political ideology defined with political\-science grounding and operationalized across dimensions\.
- •D2 \(Face\):Yes\.Provides interpretable examples and expert annotation credibility signals\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected agreement \(Krippendorff’sα\\alpha\)\.
- •D4 \(Convergent\):Partial\.Triangulates with outlet\-bias signals; not a direct external instrument correlation\.
- •D5 \(Discriminant\):Partial\.Separates ideology from stance by design; no formal nuisance\-confound tests beyond constraints\.
- •D6 \(Predictive\):No\.Evaluations are classification/label prediction\.
- •D7 \(Confounding\):Partial\.Uses design/annotation constraints \(e\.g\., controlling for stance\), but no verified causal isolation\.

#### Paper 9:[Gabriel et al\.](https://arxiv.org/html/2605.07409#bib.bib26)\(ACL[2022](https://arxiv.org/html/2605.07409#bib.bib26)\) — Misinfo Reaction Frames \(Headlines\)\.

- •D1 \(Construct\):Partial\.Defines dimensions operationally; construct remains task\-specific rather than externally standardized\.
- •D2 \(Face\):Partial\.Provides examples/descriptions; crowd annotations without expert\-panel content validity\.
- •D3 \(Reliability\):Partial\.Reports reliability for categorical judgment; limited/no parallel reliability reporting for free\-text dimensions\.
- •D4 \(Convergent\):No\.No validation against external behavioral ground truth of reactions\.
- •D5 \(Discriminant\):No\.No explicit tests distinguishing the construct from nuisances\.
- •D6 \(Predictive\):Partial\.Shows systematic shifts in trust ratings in a controlled setting; not real\-world behavioral criterion validation\.
- •D7 \(Confounding\):Partial\.Design mitigations exist, but no causal confound\-handling pipeline\.

#### Paper 10:[Hoover et al\.](https://arxiv.org/html/2605.07409#bib.bib36)\([2020](https://arxiv.org/html/2605.07409#bib.bib36)\) — Moral Sentiment \(MFT; Tweets\)\.

- •D1 \(Construct\):Yes\.Categories grounded in Moral Foundations Theory\.
- •D2 \(Face\):Yes\.Provides structured guidelines and annotator training procedures; early disagreement handling\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected agreement \(e\.g\., Fleiss’κ\\kappa, PABAK\)\.
- •D4 \(Convergent\):Partial\.No correlation with an independent external MFT instrument; convergence mostly indirect\.
- •D5 \(Discriminant\):Partial\.Conceptual separation of categories without formal nuisance\-factor tests\.
- •D6 \(Predictive\):No\.Benchmarks classifiers; no external real\-world criterion validation\.
- •D7 \(Confounding\):No\.No confound\-isolation or causal identification methods\.

#### Paper 11:[Feng et al\.](https://arxiv.org/html/2605.07409#bib.bib25)\(ACL[2023](https://arxiv.org/html/2605.07409#bib.bib25)\) — Political Bias from Partisan Pretraining \(PCT axes\)\.

- •D1 \(Construct\):Partial\.Operationalizes leaning via PCT outputs; construct remains tied to instrument choice rather than independent definition\.
- •D2 \(Face\):Partial\.Qualitative plausibility checks via examples; no formal expert content\-validity study\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected agreement for a stance detector and includes prompt\-robustness analyses\.
- •D4 \(Convergent\):Partial\.Compares against media\-bias ratings of pretraining sources \(external reference point\)\.
- •D5 \(Discriminant\):No\.No explicit tests separating ideology from superficial keyword/prompt effects\.
- •D6 \(Predictive\):Partial\.Links pretraining bias to downstream fairness shifts on benchmark tasks; still not an external real\-world criterion\.
- •D7 \(Confounding\):Partial\.Uses controlled model\-building \(e\.g\., comparable corpora sizes\) to isolate the pretraining factor; not causal identification\.

#### Paper 12:[Sky et al\.](https://arxiv.org/html/2605.07409#bib.bib64)\(EMNLP[2023](https://arxiv.org/html/2605.07409#bib.bib64)\) — Social Norms \(Cross\-cultural\)\.

- •D1 \(Construct\):Yes\.Defines norms explicitly and situates them in cross\-cultural/descriptive\-norm framing\.
- •D2 \(Face\):Yes\.Expert verification/editing plus concrete examples\.
- •D3 \(Reliability\):Partial\.Some chance\-corrected agreement is reported, but not as a core\-label inter\-annotator reliability statistic\.
- •D4 \(Convergent\):Partial\.Aligns with established cross\-cultural theory patterns; no independent gold\-standard norm instrument correlation\.
- •D5 \(Discriminant\):No\.No explicit empirical separation from nuisance factors\.
- •D6 \(Predictive\):No\.No external outcome prediction beyond the dataset/task\.
- •D7 \(Confounding\):No\.No causal/confound\-isolation methods\.

#### Paper 13:[Bang et al\.](https://arxiv.org/html/2605.07409#bib.bib7)\(ACL[2024](https://arxiv.org/html/2605.07409#bib.bib7)\) — Political Bias as Stance \+ Framing/Style \(LLM generations\)\.

- •D1 \(Construct\):Partial\.Bias defined operationally as stance plus framing/style rather than theory\-grounded construct definition\.
- •D2 \(Face\):Partial\.Qualitative examples/sanity checks; no formal content\-validity study\.
- •D3 \(Reliability\):Partial\.Uses repeated generations and significance testing, but does not report chance\-corrected reliability of the measurement procedure\.
- •D4 \(Convergent\):Partial\.Compares against prompted anchor distributions rather than independent external instrument\.
- •D5 \(Discriminant\):Partial\.Separates content from lexical polarity but lacks formal nuisance\-confound tests\.
- •D6 \(Predictive\):No\.Descriptive audit; no external criterion prediction\.
- •D7 \(Confounding\):Partial\.Standardizes prompts/topics and uses anchor distributions; not causal identification\.

#### Paper 14:[Harel\-Canada et al\.](https://arxiv.org/html/2605.07409#bib.bib33)\(EMNLP[2024](https://arxiv.org/html/2605.07409#bib.bib33)\) — Psychological Depth Scale \(Stories\)\.

- •D1 \(Construct\):Yes\.Construct grounded in literary/reader\-response theory and defined independent of dataset\.
- •D2 \(Face\):Partial\.Guidelines/training/calibration, but no formal expert content\-validity audit\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected inter\-rater reliability for human ratings \(e\.g\., Krippendorff’sα\\alpha\)\.
- •D4 \(Convergent\):Partial\.LLM\-judge scores correlate with human judgments, but not with an independent external gold\-standard instrument\.
- •D5 \(Discriminant\):Partial\.Argues beyond surface style; lacks formal nuisance\-confound tests\.
- •D6 \(Predictive\):No\.Used as evaluation rubric; no external real\-world outcome prediction\.
- •D7 \(Confounding\):Partial\.Uses design controls; no causal identification\.

#### Paper 15:[Azizov et al\.](https://arxiv.org/html/2605.07409#bib.bib4)\(EMNLP Findings[2024](https://arxiv.org/html/2605.07409#bib.bib4)\) — Political Bias and Factuality \(Cross\-lingual; xMP\)\.

- •D1 \(Construct\):Partial\.Targets operationalized via existing rating schemes rather than theory\-defined constructs\.
- •D2 \(Face\):Partial\.Relies on expert\-provided outlet\-level ratings and supplemental labels; no formal content\-validity study for the resulting instrument\.
- •D3 \(Reliability\):Partial\.Reports cross\-check alignment between article samples and outlet labels; informative but not full stability analysis\.
- •D4 \(Convergent\):No\.No clear external convergent validation reported\.
- •D5 \(Discriminant\):No\.No explicit tests against nuisance confounds\.
- •D6 \(Predictive\):No\.Benchmarks models to predict labels; no external outcome criterion\.
- •D7 \(Confounding\):No\.No causal/confound\-isolation methods\.

#### Paper 16:[Faulborn et al\.](https://arxiv.org/html/2605.07409#bib.bib24)\(ACL[2025](https://arxiv.org/html/2605.07409#bib.bib24)\) — Political Values Measurement \(WVS/EVS items\)\.

- •D1 \(Construct\):Yes\.Builds measurement around validated survey instruments and argues against ad\-hoc quizzes\.
- •D2 \(Face\):Yes\.Inherits content validity from decades of survey\-item vetting\.
- •D3 \(Reliability\):Yes\.Tests prompt sensitivity and demonstrates instability; reported as a stability diagnostic\.
- •D4 \(Convergent\):Partial\.Motivates survey validity and population alignment, but does not provide a clean correlation\-style convergent test as coded\.
- •D5 \(Discriminant\):Partial\.Diagnoses wording/prefix contamination \(nuisance sensitivity\), but does not report a classic discriminant test against pre\-specified nuisance factors\.
- •D6 \(Predictive\):No\.Focus is measurement validity rather than external outcome prediction using the derived score\.
- •D7 \(Confounding\):Partial\.Uses design\-based mitigation; not causal identification\.

#### Paper 17:[Azzopardi and Moshfeghi](https://arxiv.org/html/2605.07409#bib.bib5)\(EMNLP Findings[2025](https://arxiv.org/html/2605.07409#bib.bib5)\) — Political Overton Window \(PRISM audit\)\.

- •D1 \(Construct\):Yes\.Operationalizes Overton Window by mapping espouse/neutral/refusal across a spectrum\.
- •D2 \(Face\):Partial\.Visualizations provide plausibility, but no formal expert content\-validity study\.
- •D3 \(Reliability\):Yes\.Reports chance\-corrected agreement for the assessment procedure\.
- •D4 \(Convergent\):Partial\.Interprets boundaries relative to expected alignment behavior; no independent ideology instrument correlation\.
- •D5 \(Discriminant\):Partial\.Distinguishes position from acceptability conceptually; limited empirical nuisance\-separation evidence\.
- •D6 \(Predictive\):No\.Descriptive audit; no external criterion prediction\.
- •D7 \(Confounding\):Partial\.Persona probing and demographic/prompt heuristics; no causal identification methods\.

Similar Articles

Embeddings for Preferences, Not Semantics

arXiv cs.AI

This paper introduces a new embedding model designed to capture preferential similarity rather than just semantic similarity, improving preference prediction for collective decision-making systems.

Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation

arXiv cs.AI

This paper proves impossibility theorems showing that primacy effects, anchoring, and order-dependence are architecturally necessary biases in autoregressive language models due to causal masking constraints. The authors validate these theoretical bounds across 12 frontier LLMs and confirm related predictions through pre-registered human experiments involving working memory loads.