Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

arXiv cs.CL 06/30/26, 04:00 AM Papers
llm construct-validity measurement-instruments text-coding grain-calibration social-science
Summary
This paper examines the gap between reliability and construct validity when using LLMs as coding instruments for theoretical constructs, and proposes grain calibration as a method to decompose constructs into clause-level components for more valid measurement.
arXiv:2606.28574v1 Announce Type: new Abstract: When a large language model (LLM) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder. Yet reliability leaves construct validity untouched. The instrument may be theory-naive, reaching the code through a correlate that meets none of the demands the construct's theory makes, and no current method tells that apart from genuine measurement. We propose grain calibration as a method that closes the gap. It decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines the results through an explicit, theory-derived rule. Because the rule is stated rather than lodged in one opaque pass, its structure is evidence about the process rather than the output. It shows which components settled a code, and, when the code is wrong, whether a component was missed or an adjacent construct mistaken for it. Validation shifts from scoring an instrument's outputs against an annotator to showing that the instrument runs on the construct its theory specifies.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:27 AM
# Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs
Source: [https://arxiv.org/html/2606.28574](https://arxiv.org/html/2606.28574)
Manuel Pita Artificial Intelligence, Social Interaction and Complexity Laboratory CICANT, Universidade Lusófona 376, Campo Grande, 1700\-097 Lisbon, Portugal manuel\.pita@ulusofona\.pt

###### Abstract

When a large language model \(LLM\) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder\. Yet reliability leaves construct validity untouched\. The instrument may be theory\-naive, reaching the code through a correlate that meets none of the demands the construct’s theory makes, and no current method tells that apart from genuine measurement\. We propose grain calibration as a method that closes the gap\. It decomposes a construct into clause\-level components, tests each against the text with extractive evidence, and combines the results through an explicit, theory\-derived rule\. Because the rule is stated rather than lodged in one opaque pass, its structure is evidence about the process rather than the output\. It shows which components settled a code, and, when the code is wrong, whether a component was missed or an adjacent construct mistaken for it\. Validation shifts from scoring an instrument’s outputs against an annotator to showing that the instrument runs on the construct its theory specifies\.

*Keywords*construct validity⋅\\cdotgrain calibration⋅\\cdotlarge language models⋅\\cdotmeasurement theory⋅\\cdottext coding⋅\\cdottheory\-naive instruments

## 1LLMs as coding instruments

One of the instruments that turns text into data across the social and behavioural sciences is quietly changing hands\. Measuring a construct in text—the morality in a film review, stance in a reply to a group chat, or aspirations in an interview—traditionally meant training human coders on the inferences the construct requires\. Many are turning to using large language models \(LLMs\) instead\. They often agree with the human coders, even as closely as expert coders agree with one another\. But there is a consequential paradox in the emerging picture of LLMs as coding instruments: they can agree with humans without actually ‘measuring’ the construct’s theoretical demands\. This paper maps howagreementcame to stand in forvaliditywhen LLMs are used as coding instruments, and proposes grain calibration, an iterative procedure that tunes the instrument until its codes meet the demands of the construct’s theory\.

Whether LLMs can ‘reason’ is contested\. A recent synthesis finds it premature to decide whether they hold structured internal representations\[[49](https://arxiv.org/html/2606.28574#bib.bib101)\], and work on capability evaluation warns against inferring such capacities from task performance alone\[[46](https://arxiv.org/html/2606.28574#bib.bib113)\]\. We stay deliberately agnostic on the matter\. What follows asks not what an LLM is internally, but what it does as a coding instrument\. If we want to know what an LLM is doing, one way is to study it as we would any instrument whose internal mechanisms are sealed\. This often means designing specific inputs, feeding them to the coding instrument, and analysing what it returns\.

But why study an instrument that works? The success record of LLMs as reliable coders is substantial by now\. Handed tweets and a two\-sentence prompt, GPT is more self\-consistent across coding runs than trained coders are with each other, and outperforms crowd workers by25%25\\%\[[30](https://arxiv.org/html/2606.28574#bib.bib12)\]\. Asked to classify political affiliation, GPT\-4 reaches93%93\\%accuracy and beats expert coders, picking up oblique cues a supervised classifier misses, from Bible quotations read as conservative signalling to Down syndrome awareness read as pro\-life Republican positioning\[[52](https://arxiv.org/html/2606.28574#bib.bib13)\]\. On standard coding tasks, such as stance and hate\-speech classification, zero\-shot LLMs reach fair\-to\-substantial agreement with human coders\[[58](https://arxiv.org/html/2606.28574#bib.bib15)\]\. Similarly, in interpretive qualitative coding, a capable LLM can achieve human\-level reliability, with chain\-of\-thought prompting further refining the task\[[23](https://arxiv.org/html/2606.28574#bib.bib41)\]\. For sentiment polarity, GPT shows competitive performance, with correlations with human coders as high asr=0\.77r=0\.77, while for coding discrete emotions, it achievesF1F1scores of around0\.720\.72to0\.780\.78\[[50](https://arxiv.org/html/2606.28574#bib.bib14)\]\. Judging by the test the field actually applies \(agreement with trained human coders\), the LLM instrumentworks\.

What can we learn about the workings of the \(sealed\) LLM coding instrument? First, the studies above indicate that LLMs rely on variables present in the input text—word patterns matching a topic distribution, sentiment valence/arousal clusters, linguistic cues, and even oblique cues available through transitive word\-pattern associations\. We adopt the termsurface feature\[[18](https://arxiv.org/html/2606.28574#bib.bib2)\]for such a variable, which the LLM literature also calls ‘form’\[[11](https://arxiv.org/html/2606.28574#bib.bib64)\]\. Surface features are not the ‘deep’ properties that must be inferred from relations among features\[[29](https://arxiv.org/html/2606.28574#bib.bib3)\]\. Second, prompt and input text enter the model together as a single sequence of words, so what the input text is coded to express becomes coupled with the instrument that performs the coding\. Third, what the LLM can measure is largely constrained by the representations and associations it learned before the coding task; prompting can steer that capacity, but does not by itself establish that the construct’s theoretical relations are being tested\[[45](https://arxiv.org/html/2606.28574#bib.bib32)\]\. These are all challenges to LLMs as coders of theoretical constructs—posits with deep structure that are not directly observable but must be inferred\[[5](https://arxiv.org/html/2606.28574#bib.bib4),[40](https://arxiv.org/html/2606.28574#bib.bib5)\]\.

### 1\.1Where the instrument fails

The successes above hide two problems in how the instrument reaches a code, and neither shows up in an agreement score\. The first is ashortcut: an LLM can produce a code by reading surface features that merely correlate with a construct, rather than the deeper inferences that define it\. Such a code can agree with human ground truth while measuring something else\. The second isentanglement: the prompt is part of the instrument, not a neutral codebook instruction that defines its measurement capacity\. One consequence is that prompt rewording can change the measurement, sometimes to its opposite\.

Start with the shortcut\. An LLM coder can match human performance without the codebook, by seizing a surface feature, as ordinary language tasks show directly\[[28](https://arxiv.org/html/2606.28574#bib.bib110)\]\. Min and colleagues replaced the codes in a prompt’s worked examples with random ones, and across twelve model configurations performance fell by at most5%5\\%\. The examples were not installing the codebook but cueing associations the LLM already held\[[45](https://arxiv.org/html/2606.28574#bib.bib32)\]\. In another study, McCoy and colleagues trained a set of models, including the language model BERT, to judge whether one sentence entails another, and then built a set of test cases in which the usual heuristic—predicting entailment when the sentences share most of their words—was deliberately wrong\. The models followed the heuristic anyway, and performance dropped below10%10\\%, far under the50%50\\%they would reach by random guessing, where expert annotators scored97%97\\%\[[43](https://arxiv.org/html/2606.28574#bib.bib109)\]\. Both these studies are based on simple NLP tasks, so they establish the mechanism rather than measure the harder construct\-coding case\. But the mechanism is general; a coder can match the codes without inferring from the codebook, and the agreement does not reveal how the LLM arrived at the output codes\.

The same shortcut reaches the coding of theoretical constructs, where its cost may be a fabricated finding\. Ashwin and colleagues had four LLMs code 2,407 open\-ended interviews with Rohingya refugees and their Bangladeshi hosts against a nineteen\-code scheme built by trained sociologists\[[3](https://arxiv.org/html/2606.28574#bib.bib62)\]\. The best LLM reached a meanF1=0\.41F1=0\.41, and its errors were not random\. Instead, the errors aligned with the characteristics of the people interviewed in ten of the nineteen codes\. For low educational ambition, the errors ran48%48\\%more negative for refugees than hosts, and49%49\\%more positive for parents of sons than of daughters, enough to flip the sign of the relationship and report, from the errors alone, a pattern that appears nowhere in the interviews\. A supervised model trained on the same codes showed biased errors in only one of the nineteen and recovered the true relationships\. This rules out task difficulty as the cause\. The instrument was not reading refugee status or a child’s sex, but surface features that covary with them\. Its diagnostic value is limited, since Ashwin and colleagues establish the structure of the errors without identifying their mechanism\. They hypothesized that the studied populations are underrepresented in the LLM’s training data\.

Agreement can result from different coders making the same mistake\. Matz and colleagues had ChatGPT and human raters score speed\-date transcripts for romantic attraction, with the outcome known independently—whether the daters exchanged contact details\[[42](https://arxiv.org/html/2606.28574#bib.bib107)\]\. The LLM and human coders somewhat agreed with each other \(r=0\.21r=0\.21tor=0\.35r=0\.35\), certainly more closely than either matched the actual outcome \(aboutr=0\.12r=0\.12\)\. They all read the same wrong surface feature as a signal, taking negation as rejection, whereas in these transcripts, it weakly predicted attraction instead\. A shared folk theory, not a reading of the attraction, produced the agreement \(in a single study, and for prediction rather than coding\)\. The lesson holds across these cases: for an LLM coding instrument, agreement is reliability, not validity, because it cannot separate a reading of the target construct from a shortcut that mimics it\.

The second problem is the prompt—with all other variables fixed, rewording it can reverse the measurement\. Asked a separate yes\-or\-no question for each moral foundation\[[32](https://arxiv.org/html/2606.28574#bib.bib99),[4](https://arxiv.org/html/2606.28574#bib.bib11)\], GPT\-4 recovered about fifteen of every hundred comments human coders had marked as authority \(recallR=0\.15R=0\.15; GPT\-4 Turbo, about five\)—the instrument vastly under\-coded\[[50](https://arxiv.org/html/2606.28574#bib.bib14)\]\. Shown all the foundations at once and asked which applied, the same LLM coded far more comments as authority than those that actually expressed the construct \(precisionP=0\.13P=0\.13to0\.190\.19\), thus reversing to over\-coding\[[15](https://arxiv.org/html/2606.28574#bib.bib70)\]\. On one fixed corpus, a different prompt change produces the same reversal\. Adding worked examples flipped authority from96%96\\%more often than humans to38%38\\%less\[[1](https://arxiv.org/html/2606.28574#bib.bib54)\]\. Authority is codable\. The agreement between the best human annotator and the other coders isF1=0\.67F1=0\.67, while the best LLM reachesF1=0\.25F1=0\.25\[[15](https://arxiv.org/html/2606.28574#bib.bib70)\]\. An instrument that reverses the direction of error with the wording of the codebook is not measuring the target construct\. The reversal is unlikely to be due to random noise or to task difficulty\. The same prompt variations that flip authority leave care—the foundation LLMs code best—stable across the three studies, with the best human annotator achievingF1=0\.87F1=0\.87for care\[[15](https://arxiv.org/html/2606.28574#bib.bib70)\]\.

Two claims stand, then: the instrument probably arrives at the output codes through a shortcut that agreement cannot trace, and the prompt that runs is coupled with what it measures to a degree that rewording it changes what the instrument measures in the input text\. What separates the coding an LLM gets right from the coding it gets wrong?

### 1\.2The robot in the cave

To code a construct is to test the input text against the construct’s theoretical demands\. Two texts can share a construct’s prototypical vocabulary and express completely different constructs\. What divides such constructs is inferred rather than determined from surface features alone\. Consider the following thought experiment about a boy and his robot, B9, entering a cave and reading an inscription carved into the wall \(Box[1\.2](https://arxiv.org/html/2606.28574#S1.SS2)\)\.

Box 1\. The robot in the cave‘Danger\!’ B9 says\. The robot and the boy have found an inscription carved into the wall of the cave:*Three walked the deep path\. The stone closed behind them where water speaks\. Their last breath carries still\. Follow their names to the source\.*B9 concludes that the inscription is a deathly warning\. It has identified the relevant surface features: follow, deep path, stone closed, water speaks, and last breath\. The boy is unconvinced\. He has been taught a distinction from comparative ritual texts\. A warning severs the reader from the dead: turn back, do not follow, do not repeat their fate\. A consecration extends the bond: carry their names, continue their path, keep them present\. The same vocabulary of death serves both\.The boy tests the lines that seem decisive\. Each reads both ways\. ‘Their last breath carries still’ can mean that the lethal air has not yet cleared, or that their spirit endures among the living\. ‘Follow their names to the source’ can send the reader after the dead to the water that drowned them, or after their souls to the source\. Evenfollow, the word B9 folded into the danger, settles nothing on its own\. What settles the interpretation is the relation, taken across the whole inscription\. Nowhere does the inscription turn the reader back or break with the dead, the act a warning requires\. Instead, it holds the dead present and sends the reader on\. The inscription does not warn\. It consecrates\.

The words in the inscription, the surface features and the associations between them painted one coherent picture\. B9 was not confused about the inscription, and yet it failed\. It treated a classification that requires theoretical distinctions as a task that surface features could settle\. Reading such inscriptions requires task decomposition to separate what surface features alone can determine from what requires theoretical inference\. B9 would first ask whether the passage contains death\-related language, a surface\-cue matching problem\. It would then ask, separately, how the passage positions the reader toward the dead, a theoretically specified interpretive problem\. A recalibrated B9 keeps the two apart, answers the first, and flags the second as unresolved rather than defaulting to a confident code\.

The principle extends beyond the cave\. For some constructs, surface features constitute reliable evidence, sufficient to settle the code, while for others, a correct code needs inferences the words can only point to as evidence\. The thought experiment leaves one empirical question open\.How much of the text\-coding landscape falls on each side of the divide?

### 1\.3A gradient of theoretical structure

Consider LLM stance coding—classifying a tweet as for, against, or neutral toward a target\. A tweet that reads ‘This policy is reckless and will hurt working families’ contains negative words aimed at the target: ‘reckless,’ ‘hurt\.’ So the against stance reads off the surface\. But sentiment is a correlate of stance, not the construct\. The same stance can appear in either polarity\. Indeed, a more accurate sentiment classifier is no better at stance\[[12](https://arxiv.org/html/2606.28574#bib.bib92)\]\.

Sentiment coded as polarity—positive, negative, neutral—is the simpler case, where evaluative words fix the polarity with no target to misread\. But a more nuanced sentiment construct has a structure that becomes invisible when reduced to polarity\. For instance, compassion and outrage are different, yet both fall at the negative end of sentiment as polarity\. If the instrument is expected to code the specific type of negative sentiment, the coarse\-coding inference breaks down\. These distinctions do not follow from surface features alone\.

Moral\-foundation theory makes the distinction clearer\. The care construct, for instance, requires that \(a\) a sentient being is presented as suffering; \(b\) the author treats that suffering as morally relevant; \(c\) the author takes a stance toward it; and \(d\) the case is distinguished from fairness, loyalty, or authority\. Without these entities and their relationships, care would often be indistinguishable from negative \(harm\) sentiment\. The three cases above trace a gradient of theoretical complexity: how much each construct demands beyond what surface features alone can settle\.

An LLM coding instrument applies the same capacity to any construct across the gradient; what differs is how much inferential work each construct’s coding demands\. Call that capacity*distributional competence*: the inferences a model can draw from the ‘company words keep’ in its training data\[[26](https://arxiv.org/html/2606.28574#bib.bib114)\]\. Its range is wide, from simple keyword matching to sophisticated, multi\-step pattern recognition\[[41](https://arxiv.org/html/2606.28574#bib.bib100)\]\. But its basis is always distributional, and never, on its own, the compositional reasoning a theoretical mechanism requires\[[10](https://arxiv.org/html/2606.28574#bib.bib65)\]\.

### 1\.4Optimizing the wrong target

Still within the black\-box view of the LLM instrument, a reasonable approach to its failures is to improve the prompt, and the field’s toolkit for eliciting agreement offers several methods\. Each reworks what surrounds the coding decision without reaching the inferences the construct requires\. Fine\-tuning makes the code the training signal; McCoy and colleagues fine\-tuned models on hard cases, and the result was that the LLM followed a narrower heuristic\[[43](https://arxiv.org/html/2606.28574#bib.bib109)\]\. Few\-shot examples cue an association the model already holds\. Scrambling their labels barely changes the output\[[45](https://arxiv.org/html/2606.28574#bib.bib32)\]\. Chain\-of\-thought offers grounds without showing the code was reached by applying them\[[53](https://arxiv.org/html/2606.28574#bib.bib31),[38](https://arxiv.org/html/2606.28574#bib.bib22)\]\. A richer codebook delivers the construct’s definition but does not demonstrate theoretical engagement, and more instruction can even lower consistency\[[33](https://arxiv.org/html/2606.28574#bib.bib43),[34](https://arxiv.org/html/2606.28574#bib.bib116)\]\.

Return to the cave\. Each prompting approach hands B9 more resources, and it may read the added context well enough to draw the theoretical boundary the boy draws through inference\. Suppose it agrees with the boy on every inscription\. The boy can still explain on what grounds each text consecrates rather than warns; for him those grounds are the method\. B9 may even offer grounds of its own, but we have no way of knowing it arrived at the code by applying them\.

So optimization can raise the agreement, and one of these tools may even produce the ideal prompt that tests the construct: the informed decomposition the cave pointed toward\. From agreement alone, though, that prompt is indistinguishable from one that only found a closer correlate; the grounds of the code stay as hidden as before\. The solution cannot be improving agreement alone\.

### 1\.5Theory\-naive instruments

B9’s failure is an instance of a general condition we calltheory\-naivete: a coding instrument is theory\-naive when its coding process does not pass through the construct’snomological network—the system of relations a theory specifies between the construct, its components, and what can be observed\[[19](https://arxiv.org/html/2606.28574#bib.bib28)\]\. For the cave inscriptions, the network specifies that a ‘warning’ positions the reader away from the dead, while a ‘consecration’ positions the reader toward the dead\. B9’s inferential path does not pass through these relations\. It passes through surface features and the patterns among them\. A theory\-naive instrument may include a detailed codebook, labelled examples, and chain\-of\-thought reasoning\. The deficit is not in the instrument’s resources but in its inferential path\.

When the instrument’s codes are correct, they cannot be credited to any theoretical principles because the instrument did not test them\. Failures cannot be diagnosed for the same reason\. Earning validity requires evidence that the coding process engages the construct’s components: that each component the theory specifies is tested in its own right, and the results combined into the code by a stated rule\. A construct’s components have specific theoretical roles\. For most constructs,detectionanddistinctioncomponents are essential\. The former defines the observables and relations that trigger the construct’s expression; the latter defines observables and relations that resolve ambiguities with possible alternative neighbouring constructs\. Different constructs have distinct components that vary with the nature of the underlying theory\. The cave construct has both detection and distinction: a reference to the dead, which any reading must detect, and the stance toward them, which separates a warning from a consecration\. B9 tested the first and read its verdict straight from it, never testing the second\.

### 1\.6Meaning is not the missing piece

One response to theory\-naivete is that the instrument lacks access to meaning\. Bender and Koller\[[11](https://arxiv.org/html/2606.28574#bib.bib64)\]argue that LLMs trained on linguistic form alone cannot acquire meaning grounded outside the text\. The claim is contested\. Probing and causal\-intervention studies find that LLM representations encode discrete concepts, compositional structure, and abstract semantic roles, more than a strict form\-only reading predicts\[[49](https://arxiv.org/html/2606.28574#bib.bib101)\]\. But this does not reach the gap that matters, between accessing meaning andtesting the relationsa construct specifies\. The coding instruments examined here do not lack access to meaning; they read the codebook, process construct definitions written in natural language, and apply them\.

Consider B9’s assessment of the inscription\. It reconstructs the whole scene, three people drowned in a flooding cave\. B9 still fails because it has no theoretical principle for determining whether the inscription warns the living or consecrates the dead\. Bender and Koller’s divide is between form and meaning; the divide relevant to LLMs as coding instruments is between surface correlates and required construct inferences\. An instrument that accesses meaning can still code from patterns that correlate with the construct without testing the relations that define it\. Richer understanding does not close that gap, which marks the outer limit of distributional competence; we argue that decomposing the construct into independently testable units does\.

### 1\.7The three opacities

The aforementioned decomposition requires breaking the instrument into three parts: \(a\) the codebook that specifies the construct; \(b\) an inferential engine that applies it, and \(c\) a procedure that governs how the engine applies the codebook\. The engine may be a human coder, a dictionary method, or a language model; which inferences a codebook affords is a property of the codebook and the engine together, not of either alone\[[20](https://arxiv.org/html/2606.28574#bib.bib10)\]\. Each operation addresses a place where the inferential path from input text to code remains hidden from inspection—anopacityin the coding process\. We identify three cumulative opacity layers:

1. 1\.Definitional opacity\.A coding instrument is definitionally opaque when the construct’s theoretical components are not externalized in the codebook the instrument receives\. The instrument receives only the construct’s name, and the inferential engine supplies whatever content it already associates with that name\. The researcher cannot determine which theoretical components, if any, that content encodes, because the components were never specified as separable targets\. Closing definitional opacity requires decomposing the construct into the components the theory stipulates for a valid coding and supplying each as an explicit, independently assessable instruction\.
2. 2\.Inferential opacity\.A coding instrument is inferentially opaque when the components have been specified but the instrument does not report which evidence in the text supports each component’s score\. The codebook specifies what the instrument must check; the instrument returns a code without demonstrating whether it followed the codebook or coded from surface\-feature regularities\. Closing inferential opacity requires that each codebook component be tied to an extractive ground: a specific text span that justifies the score and can be audited independently of the final code\.
3. 3\.Compositional opacity\.An LLM coding instrument is compositionally opaque when the rule combining per\-component scores into a final code is not stated\. Closing inferential opacity produces grounded scores for each component\. Those scores must be consolidated into a single construct designation, but if the integration rule is internal to the engine, the researcher cannot determine whether one negative component vetoes the code or whether positive components override it\. Closing compositional opacity requires specifying the integration rule as an explicit function that takes inference scores as inputs to derive the code\.

The cumulative structure imposes a fixed sequence\. Inferential opacity cannot be closed until definitional opacity has been closed, because per\-component evidence requires components to already exist as separable targets\. Compositional opacity cannot be closed until inferential opacity has been closed, because an integration rule requires grounded scores to decide the final output\. In a definitionally opaque instrument, the construct’s name cues content in the LLM that the researcher cannot inspect, leaving the resulting code unexamined\. An instrument that closes all three opacities exposes the construct’s components, the text evidence supporting each, and the rule that assembles them into a code, the theory’s commitments in auditable form\.

Even the approaches that close the most opacities provide no evidence that the coding process engages the construct’s components rather than distributional correlates\. We next weigh the field’s instruments against this requirement\.

## 2Construct validity

The problem above places a single demand on a valid coding instrument: its codes must arise from the process the construct specifies, not from surface correlates\. This demand predates LLMs\. Messick called it the*substantive aspect*of construct validity\[[44](https://arxiv.org/html/2606.28574#bib.bib29)\]\. Set out in 1995 as one of six aspects any construct measure must satisfy, it is a standard this field endorses whenever it invokes construct validity\.111A recent account would ground construct validity for LLMs in Cronbach and Meehl’s nomological network rather than in Messick\[[27](https://arxiv.org/html/2606.28574#bib.bib118)\]\. The two are complementary, not rival\. The nomological network fixes the construct’s inferential machinery, while the substantive aspect requires that an instrument’s process engages it; Kane, working within the inferential account, grants that construct inferences presuppose a nomological network in the first place\[[36](https://arxiv.org/html/2606.28574#bib.bib58)\]\.A theory\-naive instrument has not earned Messick’s substantive aspect\.

That standard is no longer overlooked\. A growing body of research treats construct validity as a central question for LLMs\[[46](https://arxiv.org/html/2606.28574#bib.bib113),[27](https://arxiv.org/html/2606.28574#bib.bib118),[38](https://arxiv.org/html/2606.28574#bib.bib22),[39](https://arxiv.org/html/2606.28574#bib.bib23),[8](https://arxiv.org/html/2606.28574#bib.bib21),[6](https://arxiv.org/html/2606.28574#bib.bib119)\]\. Most of this work, however, concerns a different object—whether a benchmark measures a capability attributed to LLMs, not whether they validly code a construct\. What remains is to operationalize the standard for a coding instrument\. To our knowledge, this has not yet been done\.

The work reviewed below improves the coding instrument\. Sharper definitions, better diagnostics, and iterative refinement each strengthen it\. Yet all measure success the same way, by agreement or by properties of the output, never by whether the coding process engages the construct’s components\. Each is read here against that one standard, the substantive aspect\. Indeed, a now extensive literature refines how construct definitions and annotation guidelines reach the coding instrument\[[7](https://arxiv.org/html/2606.28574#bib.bib18),[37](https://arxiv.org/html/2606.28574#bib.bib44),[47](https://arxiv.org/html/2606.28574#bib.bib46),[51](https://arxiv.org/html/2606.28574#bib.bib48),[56](https://arxiv.org/html/2606.28574#bib.bib49),[54](https://arxiv.org/html/2606.28574#bib.bib38),[57](https://arxiv.org/html/2606.28574#bib.bib50),[22](https://arxiv.org/html/2606.28574#bib.bib42)\]\. These interventions close definitional opacity, giving the model the construct’s components rather than the code alone\.

Methodological consensus treats concordance with human annotators as the terminal validation step\. Primers, tutorials, and review articles across computational social science prescribe the same sequence: compare LLM codes with human codes, report F1 and Cohen’sκ\\kappa, and declare the instrument validated\[[48](https://arxiv.org/html/2606.28574#bib.bib35),[21](https://arxiv.org/html/2606.28574#bib.bib103),[14](https://arxiv.org/html/2606.28574#bib.bib104),[31](https://arxiv.org/html/2606.28574#bib.bib76)\]\. The most comprehensive codification, Abdurahman et al\.’s primer for evaluating LLMs in social\-science research\[[2](https://arxiv.org/html/2606.28574#bib.bib108)\], prescribes six steps: validate against human ground truth, check for demographic bias in misclassification, test prompt robustness, document parameters, handle errors, and repeat coding runs\. Every step operates on agreement or its error structure, never on whether that agreement was reached by engaging the construct’s components\.

The primer is not naive about the risks, as it checks whether misclassifications covary with speaker demographics, cites cases where models responded to disability vocabulary rather than toxicity, and acknowledges that hybrid approaches may outperform zero\-shot prompting\. These are the ingredients of a critique that asks what the instrument measures, not merely whether it agrees\. Yet they are never assembled into that critique\. The primer’s worked example validates a hypothetical moral\-foundation coding study by reporting accuracy, F1, andκ\\kappa\. It presents balanced per\-class performance as the expected outcome—for a construct on which the same research group documented F1 as low as 0\.03 for individual foundations\[[1](https://arxiv.org/html/2606.28574#bib.bib54)\]\.

### 2\.1Still the wrong reasons

One framework reaches further\. Birkenmaier et al\.’s ValiText\[[13](https://arxiv.org/html/2606.28574#bib.bib57)\]adapts Loevinger’s tripartite validity model to computational text analysis, distinguishing substantive evidence \(theoretical underpinning\), structural evidence \(model and output properties\), and external evidence \(concordance with human annotations\)\. The framework identifies concordance alone as insufficient and warns that a model with strong structural bias may produce misleading associations with external criteria\. Substantive evidence is deemed mandatory—and diagnosed as the weakest class of validity evidence in the LLM\-coding literature\.222The call to hold AI evaluation to this standard is not ours alone\. Reviewing cognitive AI benchmarks, Mitchell argues that benchmark accuracy seldom establishes the capability a test purports to measure, and frames the deficit explicitly as a lack of*construct validity*—an independent demand for the same standard from outside measurement theory\[[46](https://arxiv.org/html/2606.28574#bib.bib113)\]\.

While the diagnosis is precise, a solution remains absent\. ValiText’s seven substantive\-evidence steps require the researcher to document a literature review, justify the operationalization, produce a codebook, report interrater agreement, and justify data collection, method, preprocessing, and level of analysis\. All seven are pre\-measurement justifications\. They ask the researcher to argue that the measurement should be valid, not to demonstrate empirically that it is\.

The framework’s closest approach to empirical process evidence is its feature\-inspection step, which asks whether the model’s top\-weighted features are conceptually aligned with the construct\. But feature inspection operates at the level of individual features, not at the level of the construct’s compositional structure\. A moral\-foundation classifier can weight features such as the presence of ‘harm’ and ‘suffering’ highly, conceptually aligned features, while failing to distinguish care from fairness\. Feature alignment does not reveal compositional failure\. ValiText identifies the absence of substantive evidence in current practice\. It does not provide a procedure for producing it\.

In econometrics, the same gap appears from the other end\. Cristian Espinal Maya does provide a procedure, setting conditions under which an LLM score can serve as a measure of a latent variable and adding a statistical correction for the noise in that score\[[25](https://arxiv.org/html/2606.28574#bib.bib120)\]\. But the conditions only certify how the score behaves as a variable, such as that it never references the outcome it will later predict, and the correction only removes the bias that the score’s noise introduces into later estimates\. Neither framework asks whether the instrument engages the construct’s components\.

Three other approaches to the validity problem come closest to producing the evidence Birkenmaier’s taxonomy demands: \(a\) reasoning externalization, \(b\) iterative refinement, and \(c\) error diagnosis\. Hou et al\.\[[35](https://arxiv.org/html/2606.28574#bib.bib39)\]decompose coding tasks into step\-based chain\-of\-thought prompts, each targeting a single dimension of student annotations on a social\-learning platform\. The published prompts reveal what the decomposition operationalizes\. TheTheorizing promptchecks for opinion markers \(‘think’, ‘believe’, ‘remember’\), while theIntegration promptchecks for response markers \(‘This is true’, ‘@’\)\. The decomposition follows the codebook’s level definitions rather than the construct’s theoretical components\. Whether a student constructs an original argument, Theorizing’s definitional criterion, requires comparing the annotation to the source text; the prompt checks for vocabulary instead\. A revealing diagnostic confirms this, since including the source text as additional inputdecreasedaccuracy\. If the model were engaging the construct’s components, context should help; that it hurts indicates a surface feature strategy disrupted by additional text, not a construct\-engagement strategy aided by relevant context\.

The second approach is epistemically different\. Chausson et al\.’s Insight\-Inference Loop\[[17](https://arxiv.org/html/2606.28574#bib.bib7)\]does not treat concordance as validity\. The paper positions its output as researcher\-calibrated labels and invokes Bourdieu, Chamboredon, and Passeron’s principle that the objects of sociological investigation are constructed through controlled intervention, not given by data\. That interventionisthe loop, in which the researcher defines claims as single\-clause declarative statements, scores each with a natural\-language\-inference \(NLI\) model, calibrates decision thresholds against their own annotations, revises underperforming claims, and repeats\. Its limitation is architectural, not epistemological\. The loop calibrates the instrument against the researcher’s holistic judgement of whether a text contains a claim\. But that reference standard shares the instrument’s vulnerability\. When the NLI model responds to a component’s surface features rather than the component itself, the researcher, seeing the same vocabulary, reaches the same conclusion\. An error both make cannot be calibrated away\.

Two structural features compound this\. The entailment score compresses the whole document–claim relationship into a single number that does not decompose into per\-component evidence; the revision step pushes claims toward the corpus’s language, advising the researcher to ‘emulate the language used’ and to replace abstract claims with more literal ones, not toward what the theory says the construct is\. Suppose a researcher addressed every limitation, deriving claims from the construct’s theoretical specification, testing each independently against the component it targets, combining the results through a stated rule from theory, and requiring extractive grounding for each judgement\. The researcher would have replaced every structural element of the coding\-reliability architecture\. What remains is the NLI engine\. The engine is not the contribution; the validation architecture is\.

Xu et al\.\[[55](https://arxiv.org/html/2606.28574#bib.bib121)\]come nearest the substantive aspect, and from inside the agreement paradigm itself\. They reject the single agreement score and sort annotation errors by where they come from and how far they miss\. From how often humans and the model err together, they then estimate how much of the error the task itself makes unavoidable\. The decomposition characterizes the errors systematically: where they fall, how many, and of what kind\. What it does not test is the route, whether the model reached its label by engaging the construct’s components or by reading a correlate of them\. And its estimate of task\-inherent error holds only where the human coders themselves read the construct\. Where the coders and the model share a surface cue, the overlap is shared invalidity, which agreement cannot separate from genuine task difficulty\. Drawing that line, separating a correlate the coders happen to share from a construct that is simply hard, is what the substantive aspect requires, and where this diagnostic stops\.

What none of these approaches supplies is the procedure that would produce that evidence—decompose the construct into components, test each against the text on its own, and combine the results by a stated rule\.

The three opacities reveal the most general types of error\. For example, a coding instrument that relies solely on surface features exhibits compositional opacity, with no stated rule for combining component evidence\. ‘Harm’ and ‘suffering’ can lead to coding the care moral foundation, regardless of whether a sentient being is presented as harmed, the harm is morally appraised, or the case is distinguished from fairness or loyalty\[[50](https://arxiv.org/html/2606.28574#bib.bib14),[15](https://arxiv.org/html/2606.28574#bib.bib70),[1](https://arxiv.org/html/2606.28574#bib.bib54)\]\. The demographically structured errors documented in qualitative interview coding\[[3](https://arxiv.org/html/2606.28574#bib.bib62)\]are a consequence of inferential opacity, in which the LLM’s generative process leaves no record of which component drove the code, leaving the researcher unable to determine whether errors follow what was said or who said it\.

The gap between distributional competence and theoretical structure is not closed by refining what the instrument receives—richer definitions, more examples, externalized reasoning steps, iterative threshold calibration—because the problem is in the instrument’s architecture\. While the opacities locate where theory\-naivete operates, they do not prescribe a validation procedure\. We outline a procedure in the next section\.

## 3Grain calibration

The three opacities specify what the architecture of the LLM coding instrument must make accountable to validation: \(a\) the components a construct’s theory specifies; \(b\) the inference that tests each one against the text, and \(c\) the rule that combines those tests into code\. These three architectural elements are derived from the construct’snomological network, the relations a theory draws among the construct, its components, and what the text can show\. Calibration makes the instrument traverse the nomological network rather than bypass it, as a theory\-naive instrument does\. The decomposition follows the construct\-to\-component relations; individual clauses test a component\-to\-text relation; the integration rule encodes how the components compose the construct\. Indeed, the failure literature reviewed above shows that decomposition works: even a post\-hoc, bottom\-up decomposition of a coding task roughly doubles the variance a model’s predictions explain\[[42](https://arxiv.org/html/2606.28574#bib.bib107)\]\.

Yet, like agreement with human codes, auditability is not validity either\. An audited path can still be the wrong one\. The components may not be the construct’s, the tests may answer an easier question, or the construct’s structure may have been replaced by a surface correlate\. Validity is not earned outright but argued from evidence\.Calibratingthe now\-visible structure to the construct \(grain\) earns the substantive aspect, the missing piece of that argument\.

Grain calibration sets thegrainof the decomposition \(how finely the construct is split\) so that each piece is a question the LLM can answer reliably, then revises it until the evidence shows the theorized components are the ones doing the work\. This paper outlines the procedure and makes the case that it earns the substantive aspect\. A forthcoming paper formalizes the method and demonstrates it end\-to\-end on moral\-foundations coding\. Suffice to say, the procedure is iterative, with a human in the loop and a stop condition\. It starts from the earlier decomposition, the target construct broken into components, each a condition the theory says an instance must meet\.

The method refines one step further, splitting each theoretical component intoclauses—binary questions about what the text must show, each tied to the span that answers it\. The split does not run to arbitrary depth; it stops where every clause falls within distributional competence, where the question is one the LLM can answer from surface features without the theoretical relation the construct depends on\. Decompose too coarsely, and a clause still hides an inference the engine cannot make, a misfit visible in the instrument’s behaviour, not a judgement the analyst makes in advance\. Decompose too finely, and the clauses fragment past what the theory demands, into questions about e\.g\. language rather than the construct\.

Setting the grain is half of the procedure; the other half earns the warrant, the substantive aspect in Messick’s terms\[[44](https://arxiv.org/html/2606.28574#bib.bib29)\]\. Each clause is run on the input text, and its answer is recorded with the verbatim span that justifies it\. A stated rule, explicit and inspectable, rather than lodged in the LLM’s single generative pass, combines the clause answers into an output code\. The rule is a model fit to human codes as ground truth with per\-clause coefficients \(weights\)\. Note that this is not agreement\-maximization\. The inputs are theory\-grounded clauses, not surface features, so the weights are read against the construct’s nomological network rather than pushed to raise a score\. Through grain calibration, the instrument earns Messick’s substantive aspect, not agreement with a criterion, but evidence that the processes the theory posits are the processes the instrument runs\.

### 3\.1Clauses with grounds

Not every construct needs grain calibration\. Sentiment as polarity is distributional\. The discriminating signals, valence and arousal, already ‘live’ in the LLM, so an instrument reading only the words codes it directly\. Decomposing it would impose structure the construct lacks\. The MFT care foundation is the contrasting case\. The MFT specifies three components, and the vocabulary of harm satisfies none of them on its own\. A sentient being must be presented as suffering; that suffering must be treated as morally relevant, not an incidental detail; and the text must take an evaluative stance toward it, not a bare report\.

Each component becomes a clause—a question put to the text, answered yes or no, with the span that warrants the answer\. Does the text reference suffering or vulnerability? This clause isdetection: it specifies what the instrument must catch, and defends against the construct going uncoded where it is present\. Is the welfare of a sentient being at stake, rather than mere negative feeling? This clause isdistinction: it specifies what the instrument must refuse, and defends against coding a neighbour—loss, disgust, conflict—as care/harm\. Does the writer take an evaluative stance toward the suffering? This clause isappraisal, the third component, the one an appraisive construct requires; without it, a clinical report of injury would get coded as care\. Detection and distinction guard the two ways an instrument can fail a construct, missing what it covers and admitting what it excludes, the under\-representation and irrelevant variance a valid instrument must defend against\[[44](https://arxiv.org/html/2606.28574#bib.bib29)\]\.

An explicitintegration rulecombines the three answers into the output code\. This is the combination that the single\-question prompt cannot make explicit and is subject to calibration, because its rule is buried in the generative pass that produces the output\. The decomposition is not a rival classifier set against the prompt\. It is a diagnostic that makes the prompt’s commitments visible, clause by clause, where each can be tested against the construct\.

Grain calibration applies to target constructs whose underlying theory defines components through a nomological network, and the possible output codes are defined a priori\. Such constructs arecodable\. A construct generated from the data rather than specified in advance, such as one built through, for example, grounded theory\[[16](https://arxiv.org/html/2606.28574#bib.bib26)\], has no prior network to decompose, and is uncodable and thus unsuitable for grain calibration\. This is a necessary boundary rather than a limitation\. Grain calibration provides substantive\-aspect evidence regarding the construct’s theory, not evidence that the theory itself is correct\. A mistaken theory, therefore, yields an instrument that validly operationalizes the wrong construct\.

### 3\.2Human in the loop

Return to the cave\. B9’s verdict was itself a rule, unstated: death\-vocabulary present, therefore warning\. Written as weights, that rule puts everything on the reference clause and nothing on the stance, the clause that separates a warning from a consecration\. A rule that weighed the stance would have returned consecration; B9 never tested it\.

The weight pattern is the diagnosis, and it holds regardless of how often B9 is right\. An instrument can place all its weight on a vocabulary clause, agree with the boy on most inscriptions, and still never test the stance\. This is the difference between two questions one can put to any coding instrument: whether it agrees with the human codes, and why it reached the code it did\. Agreement answers the first and stops\. The weights answer the second, showing which clause settled the code, and that answer is evidence about the process, the substantive aspect\.333The same move—reading the process rather than scoring the output—appears in abstract reasoning: asked to state the rule behind a correct answer, models reveal that many correct outputs rest on unintended rules, right answers reached by the wrong route\[[9](https://arxiv.org/html/2606.28574#bib.bib112)\]\.

This is where the human enters the loop\. The researcher reads the fitted weights against what the theory requires\. A weight that departs from it—an essential clause left near zero, or one clause settling the whole code—points to the clause to revise\. The researcher rewrites the clause or the decomposition, re\-fits the rule, and reads the weights again\. Calibration is this loop, and the weights are what the analyst turns, iteration by iteration, until every weight is one the theory can warrant\.

The form of the rule follows the construct rather than convenience\. A regularized logistic regression serves a present\-or\-absent construct because its coefficients are readable, its decisions inspectable, and its weights revisable when the theory says a clause should matter more\. A magnitude construct takes a linear rule, and an interaction or threshold enters only when the construct’s definition specifies one\. A term added to fit the human codes better, without that warrant, is construct\-irrelevant variance at the level of composition, the very thing the rule exists to catch\.

Because the clauses are interpretable, so are the fitted rule’s errors\. Its confusion matrix, read with the weights, places each misclassification on a clause: an under\-weighted detection clause that lets the construct slip, or an under\-weighted distinction clause that lets a neighbour in\. The matrix tells the researcher which clause to revise, turning each error from a verdict into an instruction\. Three further readings feed the same revision\. The residual, the gap in agreement between the opaque prompt and the explicit decomposition, measures how much of the prompt’s behaviour the components do not yet account for; a large residual signals a correlate the theory omits\. The weight profile, one clause settling the construct while another contributes almost nothing, is a claim about where the construct’s weight concentrates, open to dispute\. Collinearity, clauses the theory treats as separate moving together across texts, marks a decomposition cut at the wrong joint\.

This raises an objection: has the validity judgement simply moved from the model to the analyst who draws the decomposition? It has not\. A decomposition is not asserted but tested by the three readings above—cut the construct at the wrong joints and the residual swells, a theory\-essential clause is left at zero weight, or separate clauses move together\. The decomposition that survives is not the analyst’s preference but a claim the evidence was given every chance to break\.

One inference runs the other way, and it is a hypothesis, not a result\. When the disagreement with the human codes is itself structured, concentrated on the clauses the theory makes decisive, the fitted rule can be read as a test of the human coding rather than of the instrument: evidence that the codes rest on a surface correlate the components exclude\. This holds only where the components are independently credited to the construct\. Absent that credit, a structured disagreement indicts the instrument as readily as the coders, so the claim must be argued from the grounding of the components, never from the fit\.

Table 1:An LLM coding instrument before and after grain calibration\. Each row is a property of the coding process, contrasting a theory\-naive instrument with the same instrument after calibration\. The codebook is delivered in both columns; what calibration changes is whether the process is decomposed into clauses, combined by a stated rule, and fed back against theory\. The last row is the decisive one, reliability alone against evidence for the substantive aspect of validity\.Grain calibration is two things at once\. It is a method for operationalizing a theoretical construct—decompose, write the clauses, state the rule—and a procedure for earning its validity, by showing through the weights and the residual that the theorized components are the ones the instrument runs on\. The two are not separable steps\. The same decomposition that builds the instrument is what makes its validity inspectable; construction and warrant are one act seen from two sides\.

This is the whole of the contribution, and it is deliberately modest in its raw materials\. There is no new standard of validity here, no new theory of meaning, no claim that language models do or do not understand\. There is a standard the field has held for three decades, and an architecture that, to our knowledge, is the first to produce the evidence that standard demands from an LLM coding instrument \(Table[1](https://arxiv.org/html/2606.28574#S3.T1)\)\. The provocation of this paper was never that agreement is worthless; it does map to reliability\. The provocation is that the field has been accepting reliability as validity because the procedure that tells them apart had not been specified\. Grain calibration makes the distinction operational—an instrument can now be asked not only whether it agrees, but whether it agrees for the reasons the construct specifies\.

The grain calibration procedure also defines the boundary of the class of coding problems it does not solve\. Statistical methods that correct a noisy instrument’s estimates operate downstream of the code and cannot reveal that the instrument measures the wrong construct; grain calibration operates upstream, on the decomposition that fixes what is measured at all\[[24](https://arxiv.org/html/2606.28574#bib.bib8)\]\. The remaining problems are concrete rather than foundational, and Box[3\.2](https://arxiv.org/html/2606.28574#S3.SS2)outlines two of them\. Each extends the same architecture, from one construct to many, and from a single model to a comparison across models\.

The title asked whether an instrument can produce correct codes for the wrong reasons\. It can, but agreement alone will never reveal it\. Grain calibration is what turns that limitation into a validation procedure\. Decompose the construct, calibrate it to what the engine can read, and require the instrument to show that its code rests on the components the theory specifies, not on the words that merely keep them company\.

Box 2\. Research agendaCross\-construct validation\.Grain calibration predicts that constructs with more theoretical structure leave larger residuals under single\-pass coding\. Testing this across constructs, from near\-distributional ones to densely appraisive ones, would turn the residual into a comparative measure of how much theory a construct demands of its instrument\.Multi\-model comparisons\.Because the weight profile is read in the theory’s own terms, the same decomposition run on different models yields comparable component\-process profiles\. These profiles offer a construct\-level way to compare what models can and cannot reliably code, beyond aggregate agreement\.

## Acknowledgements

The author thanks Daniel Cardoso and Mauricio Martins for helpful discussions\.

## Competing interest

The author declares no competing interests\.

## Funding

This work was supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No\. 101094988 \(CRESCINE\)\. Views and opinions expressed are those of the author only and do not necessarily reflect those of the European Union or the granting authority; neither can be held responsible for them\.

## Author contributions

MP conceived the framework, developed the theoretical argument, and wrote the manuscript\.

## Data availability

This article contains no original data\. All empirical findings cited are from published sources\.

## References

- \[1\]S\. Abdurahman, M\. Atari, F\. Karimi\-Malekabadi, M\. J\. Xue, J\. Trager, P\. S\. Park, P\. Golazizian, A\. Omrani, and M\. Dehghani\(2024\)Perils and opportunities in using large language models in psychological research\.PNAS Nexus3\(7\),pp\. pgae245\.Note:S\.A\. and M\.A\. contributed equallyExternal Links:[Document](https://dx.doi.org/10.1093/pnasnexus/pgae245)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p5.8),[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p10.1),[§2](https://arxiv.org/html/2606.28574#S2.p5.1)\.
- \[2\]S\. Abdurahman, A\. Salkhordeh Ziabari, A\. K\. Moore, D\. M\. Bartels, and M\. Dehghani\(2025\)A primer for evaluating large language models in social\-science research\.Advances in Methods and Practices in Psychological Science\.External Links:[Document](https://dx.doi.org/10.1177/25152459251325174)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p4.1)\.
- \[3\]J\. Ashwin, A\. Chhabra, and V\. Rao\(2025\)Using large language models for qualitative analysis can introduce serious bias\.Sociological Methods & Research\.External Links:[Document](https://dx.doi.org/10.1177/00491241251338246),ISSN 0049\-1241Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p3.3),[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p10.1)\.
- \[4\]M\. Atari, J\. Haidt, J\. Graham, S\. Koleva, S\. T\. Stevens, and M\. Dehghani\(2023\)Morality beyond the WEIRD: how the nomological network of morality varies across cultures\.Journal of Personality and Social Psychology125\(5\),pp\. 1157–1188\.Note:Introduces the MFQ\-2, splitting Fairness/Reciprocity into Equality and ProportionalityExternal Links:[Document](https://dx.doi.org/10.1037/pspp0000470)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p5.8)\.
- \[5\]S\. B\. Bacharach\(1989\)Organizational theories: Some Criteria for Evaluation\.Academy of Management Review14\(4\),pp\. 496–515\.External Links:[Document](https://dx.doi.org/10.5465/amr.1989.4308374)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p4.1)\.
- \[6\]C\. Barrie, L\. P\. Argyle, J\. Bisbee, M\. Heseltine, C\. Lucas, J\. Mellon, A\. Palmer, M\. Roberts, and A\. Spirling\(2026\)AI and research methods\.APSA Preprints \(Cambridge Open Engage\)\.External Links:[Document](https://dx.doi.org/10.33774/apsa-2026-h59kk),[Link](https://doi.org/10.33774/apsa-2026-h59kk)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p2.1)\.
- \[7\]C\. Barrie, P\. Palaiologou, and P\. Törnberg\(2024\)Prompt stability scoring for text annotation with large language models\.arXiv preprint arXiv:2407\.02039\.Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[8\]A\. M\. Bean, R\. O\. Kearns, A\. Romanou,et al\.\(2025\)Measuring what matters: construct validity in large language model benchmarks\.InAdvances in Neural Information Processing Systems 38 \(NeurIPS 2025\), Datasets and Benchmarks Track,External Links:2511\.04703Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p2.1)\.
- \[9\]C\. Beger, R\. Yi, S\. Fu, K\. Denton, A\. Moskvichev, S\. Tsai, S\. Rajamanickam, and M\. Mitchell\(2025\)Do AI models perform human\-like abstract reasoning across modalities?\.arXiv preprint arXiv:2510\.02125\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2510.02125)Cited by:[footnote 3](https://arxiv.org/html/2606.28574#footnote3)\.
- \[10\]E\. M\. Bender, T\. Gebru, A\. McMillan\-Major, and S\. Shmitchell\(2021\)On the dangers of stochastic parrots: can language models be too big?\.InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency \(FAccT\),pp\. 610–623\.External Links:[Document](https://dx.doi.org/10.1145/3442188.3445922)Cited by:[§1\.3](https://arxiv.org/html/2606.28574#S1.SS3.p4.1)\.
- \[11\]E\. M\. Bender and A\. Koller\(2020\)Climbing towards NLU: on meaning, form, and understanding in the age of data\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 5185–5198\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.463)Cited by:[§1\.6](https://arxiv.org/html/2606.28574#S1.SS6.p1.1),[§1](https://arxiv.org/html/2606.28574#S1.p4.1)\.
- \[12\]S\. E\. Bestvater and B\. L\. Monroe\(2023\)Sentiment is not stance: target\-aware opinion classification for political text analysis\.Political Analysis31\(2\),pp\. 235–256\.Note:Sentiment classifiers \(including BERT\) perform near\-randomly at stance detection; F1 drops \>30 ptsExternal Links:[Document](https://dx.doi.org/10.1017/pan.2022.10)Cited by:[§1\.3](https://arxiv.org/html/2606.28574#S1.SS3.p1.1)\.
- \[13\]L\. Birkenmaier, C\. Wagner, and C\. Lechner\(2023\)ValiText: a unified validation framework for computational text\-based measures of social constructs\.arXiv preprint\.Note:First posted 2023; bibkey uses 2024 per NARRATIVE\_LOCK conventionExternal Links:2307\.02863,[Link](https://arxiv.org/abs/2307.02863)Cited by:[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p1.1)\.
- \[14\]J\. Brickman, M\. Gupta, and J\. R\. Oltmanns\(2025\)Large language models for psychological assessment: a comprehensive overview\.Advances in Methods and Practices in Psychological Science\.External Links:[Document](https://dx.doi.org/10.1177/25152459251343582)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p4.1)\.
- \[15\]L\. Bulla, S\. De Giorgis, M\. Mongiovì, and A\. Gangemi\(2025\)Large language models meet moral values: a comprehensive assessment of moral abilities\.Computers in Human Behavior Reports17,pp\. 100609\.External Links:[Document](https://dx.doi.org/10.1016/j.chbr.2025.100609)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p5.8),[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p10.1)\.
- \[16\]K\. Charmaz\(2014\)Constructing grounded theory\.2nd edition,Introducing Qualitative Methods,Sage,London\.Note:Constructivist grounded theory: data\-driven construct generation through iterative interpretation; positioned in §1\.2 Move 3 as the canonical methodology outside this paper’s scope\.External Links:ISBN 978\-0857029140Cited by:[§3\.1](https://arxiv.org/html/2606.28574#S3.SS1.p4.1)\.
- \[17\]S\. Chausson, M\. Fourcade, D\. J\. Harding, B\. Ross, and G\. Renard\(2026\)The insight\-inference loop: efficient text classification via natural language inference and threshold\-tuning\.Sociological Methods & Research55\(2\),pp\. 568–615\.External Links:[Document](https://dx.doi.org/10.1177/00491241251326819)Cited by:[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p6.1)\.
- \[18\]M\. T\. H\. Chi, P\. J\. Feltovich, and R\. Glaser\(1981\)Categorization and Representation of Physics Problems by Experts and Novices\.Cognitive Science5\(2\),pp\. 121–152\.External Links:[Document](https://dx.doi.org/10.1207/s15516709cog0502%5F2)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p4.1)\.
- \[19\]L\. J\. Cronbach and P\. E\. Meehl\(1955\)Construct validity in psychological tests\.Psychological Bulletin52\(4\),pp\. 281–302\.External Links:[Document](https://dx.doi.org/10.1037/h0040957)Cited by:[§1\.5](https://arxiv.org/html/2606.28574#S1.SS5.p1.1)\.
- \[20\]R\. Davis, H\. Shrobe, and P\. Szolovits\(1993\)What is a knowledge representation?\.AI Magazine14\(1\),pp\. 17–33\.Note:Five roles of knowledge representation: surrogate, ontological commitments, fragmentary theory of intelligent reasoning, medium for computation, medium of human expressionExternal Links:[Document](https://dx.doi.org/10.1609/aimag.v14i1.1029)Cited by:[§1\.7](https://arxiv.org/html/2606.28574#S1.SS7.p1.1)\.
- \[21\]D\. Demszky, D\. Yang, D\. S\. Yeager, C\. J\. Bryan, M\. Clapper, S\. Chandhok, J\. C\. Eichstaedt, C\. Hecht, J\. Jamieson, M\. Johnson, M\. Jones, D\. Krettek\-Cobb, L\. Lai, N\. JonesMitchell, D\. C\. Ong, C\. S\. Dweck, J\. J\. Gross, and J\. W\. Pennebaker\(2023\)Using large language models in psychology\.Nature Reviews Psychology2\(11\),pp\. 688–701\.External Links:[Document](https://dx.doi.org/10.1038/s44159-023-00241-5)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p4.1)\.
- \[22\]E\. Dubourg, V\. Thouzeau, and N\. Baumard\(2024\)A step\-by\-step method for cultural annotation by LLMs\.Frontiers in Artificial Intelligence7,pp\. 1365508\.Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[23\]Z\. O\. Dunivin\(2024\)Scalable qualitative coding with LLMs: chain\-of\-thought reasoning matches human performance in some hermeneutic tasks\.arXiv preprint arXiv:2401\.15170\.Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p3.6)\.
- \[24\]N\. Egami, M\. Hinck, B\. M\. Stewart, and H\. Wei\(2023\)Using imperfect surrogates for downstream inference: design\-based supervised learning for social science applications of large language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/d862f7f5445255090de13b825b880d59-Abstract-Conference.html)Cited by:[§3\.2](https://arxiv.org/html/2606.28574#S3.SS2.p10.1)\.
- \[25\]C\. Espinal Maya\(2026\)Measuring what cannot be surveyed: LLMs as instruments for latent cognitive variables in labor economics\.arXiv preprint arXiv:2604\.02403\.External Links:2604\.02403,[Document](https://dx.doi.org/10.48550/arXiv.2604.02403),[Link](https://arxiv.org/abs/2604.02403)Cited by:[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p4.1)\.
- \[26\]J\. R\. Firth\(1957\)A synopsis of linguistic theory, 1930–1955\.InStudies in Linguistic Analysis,pp\. 1–32\.Cited by:[§1\.3](https://arxiv.org/html/2606.28574#S1.SS3.p4.1)\.
- \[27\]T\. Freiesleben\(2026\)Establishing construct validity in LLM capability benchmarks requires nomological networks\.arXiv preprint arXiv:2603\.15121\.External Links:2603\.15121,[Document](https://dx.doi.org/10.48550/arXiv.2603.15121),[Link](https://arxiv.org/abs/2603.15121)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p2.1),[footnote 1](https://arxiv.org/html/2606.28574#footnote1)\.
- \[28\]R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann\(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2\(11\),pp\. 665–673\.External Links:[Document](https://dx.doi.org/10.1038/s42256-020-00257-z)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p2.4)\.
- \[29\]D\. Gentner\(1983\)Structure\-Mapping: A Theoretical Framework for Analogy\.Cognitive Science7\(2\),pp\. 155–170\.External Links:[Document](https://dx.doi.org/10.1207/s15516709cog0702%5F3)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p4.1)\.
- \[30\]F\. Gilardi, M\. Alizadeh, and M\. Kubli\(2023\)ChatGPT outperforms crowd workers for text\-annotation tasks\.Proceedings of the National Academy of Sciences120\(30\),pp\. e2305016120\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2305016120)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p3.6)\.
- \[31\]A\. Goddard and A\. Gillespie\(2025\)The repeated adjustment of measurement protocols \(RAMP\) method for developing high\-validity text classifiers\.Psychological Methods\.Note:Advance online publicationExternal Links:[Document](https://dx.doi.org/10.1037/met0000787)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p4.1)\.
- \[32\]J\. Graham, J\. Haidt, S\. Koleva, M\. Motyl, R\. Iyer, S\. P\. Wojcik, and P\. H\. Ditto\(2013\)Moral foundations theory: The pragmatic validity of moral pluralism\.InAdvances in Experimental Social Psychology,P\. Devine and A\. Plant \(Eds\.\),Vol\.47,pp\. 55–130\.External Links:[Document](https://dx.doi.org/10.1016/B978-0-12-407236-7.00002-4)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p5.8)\.
- \[33\]A\. Halterman and K\. A\. Keith\(2025\)Codebook LLMs: evaluating LLMs as measurement tools for political science concepts\.Political Analysis\.Note:Online firstExternal Links:[Document](https://dx.doi.org/10.1017/pan.2025.10017)Cited by:[§1\.4](https://arxiv.org/html/2606.28574#S1.SS4.p1.1)\.
- \[34\]A\. Herderich, J\. Lasser, M\. Galesic, S\. T\. Aroyehun, D\. Garcia, and J\. Garland\(2025\)Measuring complex constructs in large\-scale text with computational social mixed methods\.PsyArXiv\.External Links:[Document](https://dx.doi.org/10.31234/osf.io/tzc9p),[Link](https://doi.org/10.31234/osf.io/tzc9p)Cited by:[§1\.4](https://arxiv.org/html/2606.28574#S1.SS4.p1.1)\.
- \[35\]C\. Hou, G\. Zhu, L\. Zheng, X\. Huang, T\. Zhong, H\. Li, H\. Du, and C\. L\. Ker\(2024\)Prompt\-based and fine\-tuned GPT models for context\-dependent and\-independent deductive coding in social annotation\.InProceedings of the 14th Learning Analytics and Knowledge Conference,pp\. 518–528\.Cited by:[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p5.1)\.
- \[36\]M\. T\. Kane\(2013\)Validating the interpretations and uses of test scores\.Journal of Educational Measurement50\(1\),pp\. 1–73\.External Links:[Document](https://dx.doi.org/10.1111/jedm.12000)Cited by:[footnote 1](https://arxiv.org/html/2606.28574#footnote1)\.
- \[37\]K\. W\. Kim, R\. Islamaj, J\. Kim, F\. Boudin, and A\. Aizawa\(2025\)Repurposing annotation guidelines to instruct LLM annotators: a case study\.InInternational Conference on Applications of Natural Language to Information Systems \(NLDB 2025\),Lecture Notes in Computer Science,pp\. 140–151\.Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[38\]Z\. Lin\(2025\)A validity\-guided workflow for robust large language model research in psychology\.arXiv preprint arXiv:2507\.04491\.External Links:[Document](https://dx.doi.org/10.31234/osf.io/xw98v)Cited by:[§1\.4](https://arxiv.org/html/2606.28574#S1.SS4.p1.1),[§2](https://arxiv.org/html/2606.28574#S2.p2.1)\.
- \[39\]Z\. Lin\(2025\)From prompts to constructs: a dual\-validity framework for LLM research in psychology\.arXiv preprint arXiv:2506\.16697\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.16697)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p2.1)\.
- \[40\]K\. MacCorquodale and P\. E\. Meehl\(1948\)On a distinction between hypothetical constructs and intervening variables\.Psychological Review55\(2\),pp\. 95–107\.External Links:[Document](https://dx.doi.org/10.1037/h0056029)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p4.1)\.
- \[41\]K\. Mahowald, A\. A\. Ivanova, I\. A\. Blank, N\. Kanwisher, J\. B\. Tenenbaum, and E\. Fedorenko\(2024\)Dissociating language and thought in large language models\.Trends in Cognitive Sciences28\(6\),pp\. 517–540\.External Links:[Document](https://dx.doi.org/10.1016/j.tics.2024.01.011),ISSN 1364\-6613,[Link](https://www.sciencedirect.com/science/article/pii/S1364661324000275)Cited by:[§1\.3](https://arxiv.org/html/2606.28574#S1.SS3.p4.1)\.
- \[42\]S\. C\. Matz, H\. Peters, M\. Cerf, E\. Grunenberg, P\. W\. Eastwick, M\. D\. Back, and E\. J\. Finkel\(2026\)Large language models can detect verbal indicators of romantic attraction\.Scientific Reports\.External Links:[Document](https://dx.doi.org/10.1038/s41598-026-52308-x)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p4.3),[§3](https://arxiv.org/html/2606.28574#S3.p1.1)\.
- \[43\]R\. T\. McCoy, E\. Pavlick, and T\. Linzen\(2019\)Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 3428–3448\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1334)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p2.4),[§1\.4](https://arxiv.org/html/2606.28574#S1.SS4.p1.1)\.
- \[44\]S\. Messick\(1995\)Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning\.American Psychologist50\(9\),pp\. 741–749\.External Links:[Document](https://dx.doi.org/10.1037/0003-066X.50.9.741)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.28574#S3.SS1.p2.1),[§3](https://arxiv.org/html/2606.28574#S3.p5.1)\.
- \[45\]S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer\(2022\)Rethinking the Role of Demonstrations: What makes in\-context learning work?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Abu Dhabi, United Arab Emirates,pp\. 11048–11064\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.759)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p2.4),[§1\.4](https://arxiv.org/html/2606.28574#S1.SS4.p1.1),[§1](https://arxiv.org/html/2606.28574#S1.p4.1)\.
- \[46\]M\. Mitchell\(2026\)Six principles for evaluating cognitive capabilities in AI models\.AI Magazine47\(2\),pp\. e70061\.External Links:[Document](https://dx.doi.org/10.1002/aaai.70061)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p2.1),[§2](https://arxiv.org/html/2606.28574#S2.p2.1),[footnote 2](https://arxiv.org/html/2606.28574#footnote2)\.
- \[47\]S\. Mohammadi, B\. H\. Vedula, H\. Lamba, E\. Raff, P\. Kumaraguru, F\. Ferraro, and M\. Gaur\(2025\)Do LLMs adhere to label definitions? Examining their receptivity to external label definitions\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 32380–32393\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1648)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[48\]N\. Pangakis, S\. Wolken, and N\. Fasching\(2023\)Automated annotation with generative AI requires validation\.arXiv preprint arXiv:2306\.00176\.Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p4.1)\.
- \[49\]E\. Pavlick\(2023\)Symbols and grounding in large language models\.Philosophical Transactions of the Royal Society A381\(2251\),pp\. 20220041\.External Links:[Document](https://dx.doi.org/10.1098/rsta.2022.0041),ISSN 1364\-503X,[Link](https://royalsocietypublishing.org/doi/10.1098/rsta.2022.0041)Cited by:[§1\.6](https://arxiv.org/html/2606.28574#S1.SS6.p1.1),[§1](https://arxiv.org/html/2606.28574#S1.p2.1)\.
- \[50\]S\. Rathje, D\. Mirea, I\. Sucholutsky, R\. Marjieh, C\. E\. Robertson, and J\. J\. Van Bavel\(2024\)GPT is an effective tool for multilingual psychological text analysis\.Proceedings of the National Academy of Sciences121\(34\),pp\. e2308950121\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2308950121)Cited by:[§1\.1](https://arxiv.org/html/2606.28574#S1.SS1.p5.8),[§1](https://arxiv.org/html/2606.28574#S1.p3.6),[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p10.1)\.
- \[51\]O\. Sainz, I\. García\-Ferrero, R\. Agerri, O\. Lopez de Lacalle, G\. Rigau, and E\. Agirre\(2024\)GoLLIE: annotation guidelines improve zero\-shot information\-extraction\.InProceedings of the Twelfth International Conference on Learning Representations \(ICLR 2024\),Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[52\]P\. Törnberg\(2025\)Large language models outperform expert coders and supervised classifiers at annotating political social media messages\.Social Science Computer Review43\(6\),pp\. 1181–1195\.Note:Bibkey retains 2024 for cite\-key stability; published 2025External Links:[Document](https://dx.doi.org/10.1177/08944393241286471)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p3.6)\.
- \[53\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§1\.4](https://arxiv.org/html/2606.28574#S1.SS4.p1.1)\.
- \[54\]Z\. Xiao, X\. Yuan, Q\. V\. Liao, R\. Abdelghani, and P\. Oudeyer\(2023\)Supporting qualitative analysis with large language models: combining codebook with GPT\-3 for deductive coding\.InCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces \(IUI ’23\),pp\. 75–78\.External Links:[Document](https://dx.doi.org/10.1145/3581754.3584136)Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[55\]Z\. Xu, V\. Khatri, Y\. Dai, X\. Liu, S\. Li, X\. Zhang, and R\. Yu\(2026\)Enhancing LLM\-based data annotation with error decomposition\.InProceedings of the International Conference on Learning Analytics and Knowledge \(LAK ’26\),External Links:2601\.11920,[Document](https://dx.doi.org/10.1145/3785022.3785070),[Link](https://arxiv.org/abs/2601.11920)Cited by:[§2\.1](https://arxiv.org/html/2606.28574#S2.SS1.p8.1)\.
- \[56\]F\. Yin, J\. Vig, P\. Laban, S\. Joty, C\. Xiong, and C\. J\. Wu\(2023\)Did you read the instructions? Rethinking the effectiveness of task definitions in instruction learning\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL 2023\),Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[57\]A\. Zamai, A\. Zugarini, L\. Rigutini, M\. Ernandes, and M\. Maggini\(2024\)Show less, instruct more: enriching prompts with definitions and guidelines for zero\-shot NER\.arXiv preprint arXiv:2407\.01272\.Cited by:[§2](https://arxiv.org/html/2606.28574#S2.p3.1)\.
- \[58\]C\. Ziems, W\. Held, O\. Shaikh, J\. Chen, Z\. Zhang, and D\. Yang\(2024\)Can large language models transform computational social science?\.Computational Linguistics50\(1\),pp\. 237–291\.External Links:[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00502)Cited by:[§1](https://arxiv.org/html/2606.28574#S1.p3.6)\.
Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs

Similar Articles

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Evaluating LLMs as Human Surrogates in Controlled Experiments

Submit Feedback

Similar Articles

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance
Evaluating LLMs as Human Surrogates in Controlled Experiments