Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs

arXiv cs.CL Papers

Summary

This paper provides causal evidence that large language models acquire negative linguistic knowledge (what not to say) through statistical preemption, a mechanism from Construction Grammar, by showing that manipulating competing-form frequencies via fine-tuning shifts preemption behavior in predicted directions.

arXiv:2605.23039v1 Announce Type: new Abstract: How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*donated the library the books"). We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design. Across four experiments spanning 120 English verb-construction pairings (dative, causative, locative), we show that (1) LLM surprisal patterns correlate strongly with human acceptability judgments ($r = 0.79$), validated against three independent behavioral datasets; (2) these patterns are driven by competing-form frequency rather than overall verb frequency, confirmed by non-circular partial correlations; (3) preemption sensitivity scales as a power law with model size; and (4) a controlled fine-tuning intervention causally demonstrates that manipulating competing-form frequencies shifts preemption behavior in the predicted direction, with reverse-direction controls ruling out frequency-sensitivity confounds. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:57 AM

# Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs
Source: [https://arxiv.org/html/2605.23039](https://arxiv.org/html/2605.23039)
Dongxin Guo The University of Hong Kong Hong Kong, China bettyguo@connect\.hku\.hk &Jikun Wu Stellaris AI Limited Hong Kong, China hk950014@connect\.hku\.hk &Siu Ming Yiu The University of Hong Kong Hong Kong, China smyiu@cs\.hku\.hk

###### Abstract

How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes*statistical preemption*: exposure to a conventional form \(e\.g\., “*donated the books to the library*”\) preempts structurally possible but unattested alternatives \(“*\*donated the library the books*”\)\. We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design\. Across four experiments spanning 120 English verb–construction pairings \(dative, causative, locative\), we show that \(1\) LLM surprisal patterns correlate strongly with human acceptability judgments \(r=0\.79r=0\.79\), validated against three independent behavioral datasets; \(2\) these patterns are driven by competing\-form frequency rather than overall verb frequency, confirmed by non\-circular partial correlations; \(3\) preemption sensitivity scales as a power law with model size; and \(4\) a controlled fine\-tuning intervention causally demonstrates that manipulating competing\-form frequencies shifts preemption behavior in the predicted direction, with reverse\-direction controls ruling out frequency\-sensitivity confounds\. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar\.

Do Language Models Know What*Not*to Say? Causal Evidence for Statistical Preemption in LLMs

Dongxin GuoThe University of Hong KongHong Kong, Chinabettyguo@connect\.hku\.hkJikun WuStellaris AI LimitedHong Kong, Chinahk950014@connect\.hku\.hkSiu Ming YiuThe University of Hong KongHong Kong, Chinasmyiu@cs\.hku\.hk

## 1Introduction

How do language learners come to know what*not*to say? A child who hears “*She donated the books to the library*” must eventually learn that “*\*She donated the library the books*” is unacceptable, despite never being told so and despite the double\-object construction being perfectly productive with semantically similar verbs like*give*\(Pinker,[1989](https://arxiv.org/html/2605.23039#bib.bib66); Gropenet al\.,[1989](https://arxiv.org/html/2605.23039#bib.bib39)\)\. This “retreat from overgeneralization” constitutes Baker’s Paradox\(Baker,[1979](https://arxiv.org/html/2605.23039#bib.bib76)\), one of the deepest puzzles in language acquisition\.

Construction Grammar offers an influential solution throughstatistical preemption: learners acquire negative knowledge by tracking the frequency of competing conventional forms\(Goldberg,[2005](https://arxiv.org/html/2605.23039#bib.bib21),[2018](https://arxiv.org/html/2605.23039#bib.bib22)\)\. When a speaker repeatedly encounters*donate*in the prepositional dative in contexts where the double\-object would be functionally equivalent, this accumulated evidence “preempts” the unattested alternative\(Goldberg,[2011](https://arxiv.org/html/2605.23039#bib.bib23),[2016](https://arxiv.org/html/2605.23039#bib.bib24)\)\. Preemption differs fromentrenchment, the proposal that hearing a verb frequently in*any*construction reduces willingness to use it in novel ones\(Brooks and Tomasello,[1999](https://arxiv.org/html/2605.23039#bib.bib36); Ambridge,[2020](https://arxiv.org/html/2605.23039#bib.bib27)\)\. Preemption requires exposure to a specific*competing form*with the same communicative function\(Winkleret al\.,[2015](https://arxiv.org/html/2605.23039#bib.bib58); Samaraet al\.,[2025](https://arxiv.org/html/2605.23039#bib.bib59)\)\.

We ask whether LLMs, trained purely on distributional statistics without explicit grammatical instruction, capture the distributional signature of statistical preemption paralleling that observed in human speakers of English\. This question is significant because: LLMs provide a controlled test of whether distributional learning alone suffices to acquire negative knowledge\(Misra and Mahowald,[2024](https://arxiv.org/html/2605.23039#bib.bib52); Yaoet al\.,[2025](https://arxiv.org/html/2605.23039#bib.bib71); Wonnacottet al\.,[2008](https://arxiv.org/html/2605.23039#bib.bib38)\); per\-verb LLM effects can be compared item\-by\-item to human behavioral data using validated linking hypotheses\(Huet al\.,[2024](https://arxiv.org/html/2605.23039#bib.bib69)\); and controlled training interventions can provide*causal*evidence that preemption operates through distributional competition\.

Baker’s ParadoxConstructionGrammarmotivatesPreemptionEntrenchmentcompetingformsΔ​S∝\\Delta S\\\!\\propto\\\!competingform frequencyΔ​S∝\\Delta S\\\!\\propto\\\!overallverb frequencyCorrelational\(Exps 1–3\)Causal\(Exp 4\)testtestHuman BehavioralGround TruthvalidatevalidateFigure 1:Conceptual framework\. Baker’s Paradox motivates two competing accounts from Construction Grammar, namely preemption and entrenchment, which make distinct predictions about LLM surprisal patterns\. We test these with correlational evidence \(Experiments 1–3\) and causal evidence from controlled training interventions \(Experiment 4\), triangulated against human behavioral data from three independent sources\.Following recent work using LLMs as instruments for testing scientific hypotheses about language\(McCoyet al\.,[2024](https://arxiv.org/html/2605.23039#bib.bib68); Futrellet al\.,[2019](https://arxiv.org/html/2605.23039#bib.bib49); Wilcoxet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib2); Baroni,[2022](https://arxiv.org/html/2605.23039#bib.bib51)\), we design four experiments: \(1\) testing whether LLM surprisal distinguishes preempted from non\-preempted English forms and correlates with human acceptability; \(2\) dissociating preemption from entrenchment using ’s \([2015](https://arxiv.org/html/2605.23039#bib.bib58)\) \+Competing/−\-Competing design with non\-circular partial correlations against human data; \(3\) fitting a formal scaling law across 14 models; and \(4\) providing converging causal evidence via fine\-tuning with manipulated frequencies, replicated across five seeds with a reverse\-direction control\. We build on a growing body of causal work using controlled\-input training in this area\(Misra and Mahowald,[2024](https://arxiv.org/html/2605.23039#bib.bib52); Misra and Kim,[2024](https://arxiv.org/html/2605.23039#bib.bib70); Yaoet al\.,[2025](https://arxiv.org/html/2605.23039#bib.bib71)\); our specific methodological contribution is the combination of preemption–entrenchment dissociation in LLMs with non\-circular human\-data validation \(rpartial=0\.58r\_\{\\text\{partial\}\}=0\.58\), strong item\-level LLM–human correlations \(r=0\.79r=0\.79\) triangulated across three behavioral datasets, a formal scaling analysis, and a reverse\-direction asymmetry control, with converging evidence across dative, causative, and locative alternations in English\.

## 2Background

### 2\.1Statistical Preemption Theory

Within the broader framework of Construction Grammar\(Goldberg,[1995](https://arxiv.org/html/2605.23039#bib.bib20),[2005](https://arxiv.org/html/2605.23039#bib.bib21); Bybee,[2010](https://arxiv.org/html/2605.23039#bib.bib30); Diessel,[2019](https://arxiv.org/html/2605.23039#bib.bib31)\), statistical preemption was formalized byGoldberg \([2011](https://arxiv.org/html/2605.23039#bib.bib23)\)as follows: a verbvvis preempted from constructionAAwhen speakers have accumulated sufficient evidence that a competing constructionBBis conventionally used withvvin the same communicative contexts\. Formally:

P​\(CxB∣context for CxA,v\)≫0P\(\\text\{Cx\}\_\{B\}\\mid\\text\{context for Cx\}\_\{A\},\\;v\)\\gg 0\(1\)Goldberg \([2018](https://arxiv.org/html/2605.23039#bib.bib22)\)situated this mechanism within a broader framework balancing*coverage*\(the pressure to extend constructions productively\) against*competition*\(the inhibitory force of established alternatives\)\. The key prediction is*gradient*: unacceptability varies continuously with the strength of competing evidence\(Goldberg,[2016](https://arxiv.org/html/2605.23039#bib.bib24); Bresnan and Ford,[2010](https://arxiv.org/html/2605.23039#bib.bib33); Barak and Goldberg,[2017](https://arxiv.org/html/2605.23039#bib.bib19)\)\.

The behavioral evidence for preemption has grown substantially\.Boydet al\.\([2009](https://arxiv.org/html/2605.23039#bib.bib26)\)first demonstrated the mechanism for a\-adjective production\.Winkleret al\.\([2015](https://arxiv.org/html/2605.23039#bib.bib58)\)provided the crucial dissociation: for verbs*with*a competing form, competing\-form frequency predicted unacceptability; for verbs*without*a competing form, this effect vanished\.Tachihara and Goldberg \([2025](https://arxiv.org/html/2605.23039#bib.bib61)\)provided the first*causal*evidence in humans\.Samaraet al\.\([2025](https://arxiv.org/html/2605.23039#bib.bib59)\), using five artificial\-language\-learning studies, found all five supported preemption while three showed null entrenchment effects\.Wonnacottet al\.\([2008](https://arxiv.org/html/2605.23039#bib.bib38)\)demonstrated distributional learning of argument structure\. Bayesian approaches\(Perforset al\.,[2011](https://arxiv.org/html/2605.23039#bib.bib37)\)show that negative knowledge can arise from Bayesian inference over positive data\.

The preemption\-entrenchment debate has been nuanced\.Ambridgeet al\.\([2015](https://arxiv.org/html/2605.23039#bib.bib62)\)found evidence for entrenchment but not preemption, though with high collinearity\.Ambridgeet al\.\([2018](https://arxiv.org/html/2605.23039#bib.bib63)\)showed both effects are reliable but rarely separable\.Stefanowitsch \([2006](https://arxiv.org/html/2605.23039#bib.bib65)\)argued that negative evidence from corpus frequencies constrains overgeneralization\. Our computational approach offers a complementary resolution: by using models for which the training distribution is fully known, we can directly compute preemption and entrenchment scores and assess their independent contributions\(Perek,[2015](https://arxiv.org/html/2605.23039#bib.bib29),[2014](https://arxiv.org/html/2605.23039#bib.bib28); Ellis,[2002](https://arxiv.org/html/2605.23039#bib.bib32)\)\.

### 2\.2The Dative Alternation as a Test Case

The English dative alternation is ideal for testing preemption because it involves many verbs distributed across the full range of alternation behavior\. Many verbs alternate freely, but others are restricted:*donate*,*explain*, and*whisper*resist the double\-object, while*cost*and*fine*resist the prepositional\(Levin,[1993](https://arxiv.org/html/2605.23039#bib.bib40); Hovav and Levin,[2008](https://arxiv.org/html/2605.23039#bib.bib41); Gropenet al\.,[1989](https://arxiv.org/html/2605.23039#bib.bib39)\)\.Bresnanet al\.\([2007](https://arxiv.org/html/2605.23039#bib.bib42)\)modeled the alternation as a probabilistic choice governed by approximately ten factors including animacy, pronominality, and given/new status\.Hawkinset al\.\([2020](https://arxiv.org/html/2605.23039#bib.bib67)\)created the DAIS benchmark, comprising 50,000 forced\-choice human judgments across 200 dative verbs: the gradient, item\-level human data essential for testing preemption\.

### 2\.3Surprisal as a Linking Hypothesis

We use word\-level surprisal,S​\(wt\)=−log2⁡P​\(wt∣w<t\)S\(w\_\{t\}\)=\-\\log\_\{2\}P\(w\_\{t\}\\mid w\_\{<t\}\), as our linking hypothesis between LLM representations and human acceptability\(Hale,[2001](https://arxiv.org/html/2605.23039#bib.bib1); Levy,[2008](https://arxiv.org/html/2605.23039#bib.bib43)\)\. This relationship is well\-established across six orders of magnitude\(Smith and Levy,[2013](https://arxiv.org/html/2605.23039#bib.bib44); Shainet al\.,[2024](https://arxiv.org/html/2605.23039#bib.bib45)\), with better models yielding better cognitive predictors\(Goodkind and Bicknell,[2018](https://arxiv.org/html/2605.23039#bib.bib46); Michaelovet al\.,[2024](https://arxiv.org/html/2605.23039#bib.bib47)\)\.Huet al\.\([2024](https://arxiv.org/html/2605.23039#bib.bib69)\)further demonstrated that minimal\-pair surprisal differences predict*item\-level*variation in human grammaticality judgments\.

### 2\.4LLMs and Construction Grammar

Liet al\.\([2022](https://arxiv.org/html/2605.23039#bib.bib3)\)found that sentences sharing a construction are closer in LLM embedding space\.Scivetti and Schneider \([2025](https://arxiv.org/html/2605.23039#bib.bib56)\)showed BERT captures construction\-level distinctions\.Misra and Mahowald \([2024](https://arxiv.org/html/2605.23039#bib.bib52)\)demonstrated that LMs learn rare constructions from distributional evidence\. Most directly,Yaoet al\.\([2025](https://arxiv.org/html/2605.23039#bib.bib71)\)showed that dative preferences in LMs are shaped by indirect statistical patterns and validated this against human judgments \(their Fig\. 3\)\.Misra and Kim \([2024](https://arxiv.org/html/2605.23039#bib.bib70)\)treated cross\-dative generalization as a case of statistical preemption, training models on child\-directed speech to test when generalization \(vs\. preemption\) is likely for novel verbs; their controlled\-input training, like our fine\-tuning intervention, constitutes a causal manipulation of distributional inputs\. Rather than claiming priority for “causal” evidence, our study complements this growing line of work by combining four features in a single design: \(1\) directly dissociating preemption from entrenchment via the \+Competing/−\-Competing scheme\(Weissweileret al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib4)\); \(2\) non\-circular partial correlations against human data; \(3\) a reverse\-direction control that diagnoses the asymmetry predicted by preemption theory; and \(4\) multi\-construction scope \(dative, causative, locative\)\. Table[1](https://arxiv.org/html/2605.23039#S2.T1)situates these contributions relative to the most closely related studies\.

Table 1:Feature comparison with most relevant prior work\.aYaoet al\.\([2025](https://arxiv.org/html/2605.23039#bib.bib71)\)report human judgment comparisons \(their Fig\. 3\) but do not directly dissociate preemption from entrenchment\.bMisra and Mahowald \([2024](https://arxiv.org/html/2605.23039#bib.bib52)\)and the closely relatedMisra and Kim \([2024](https://arxiv.org/html/2605.23039#bib.bib70)\)both train/fine\-tune LMs with controlled inputs, an approach as causal as ours; we cite both\. Our methodological contribution is the combination of multi\-construction scope, formal scaling analysis, non\-circular human\-data validation, and a reverse\-direction control in a single causal design\.

## 3Experimental Design

### 3\.1Stimulus Materials

We constructed 120 verb–construction items organized along two dimensions:preemption strength\(strong, weak, none\) andconstruction type\(dative, causative, locative\)\. The full stimulus set is in Appendix[A](https://arxiv.org/html/2605.23039#A1)\.

Dative verbs \(80 items\)\.Selected fromLevin \([1993](https://arxiv.org/html/2605.23039#bib.bib40)\)andHawkinset al\.\([2020](https://arxiv.org/html/2605.23039#bib.bib67)\), classified*a priori*by corpus\-based preemption strength computed from an independent held\-out sample of the British National Corpus*before any model was run*:*Strong*preemption \(27 verbs;≥\\geq80% in one frame, e\.g\.,*donate, explain, whisper*\),*Weak*preemption \(26 verbs; 55–79%, e\.g\.,*ship, toss, carry*\), and*No*preemption \(27 verbs; near\-equal alternation, e\.g\.,*give, send, offer*\)\.

Causative verbs \(20 items\)andLocative verbs \(20 items\)were adapted fromAmbridgeet al\.\([2008](https://arxiv.org/html/2605.23039#bib.bib64)\)andWinkleret al\.\([2015](https://arxiv.org/html/2605.23039#bib.bib58)\), spanning the preemption continuum for each alternation\. For each verb, we created matched sentence pairs controlling for sentence length \(±\\pm2 words\), subject animacy, object definiteness, and tense\. Each verb appeared in 5 distinct sentence frames \(Appendix[B](https://arxiv.org/html/2605.23039#A2)\)\.

### 3\.2Human Behavioral Data

We use three independent sources of human behavioral ground truth\.DAIS\(Hawkinset al\.,[2020](https://arxiv.org/html/2605.23039#bib.bib67)\): per\-verb bias scores from 50,000 forced\-choice judgments across 200 dative verbs collected from 500 participants; all 80 of our dative verbs are present\. Alignment with human datasets is at the verb level: DAIS provides per\-verb bias scores averaged across multiple sentence frames, and our per\-verbΔ​S\\Delta Svalues are similarly averaged across 5 frames, ensuring both measures reflect stable, frame\-independent verb preferences\.Robenalt & Goldberg\(Winkleret al\.,[2015](https://arxiv.org/html/2605.23039#bib.bib58)\): Likert\-scale ratings \(1–7\) for 24 causative verbs from 108 participants\.Tachihara & Goldberg\(Tachihara and Goldberg,[2020](https://arxiv.org/html/2605.23039#bib.bib60),[2025](https://arxiv.org/html/2605.23039#bib.bib61)\): graded acceptability data for dative pairings from both L1 and L2 English speakers, complementing Experiment 4 by providing human causal evidence\.111Scope of human comparisons\.Our human\-judgment comparisons cover the dative \(DAIS, T&G\) and causative \(R&G\) alternations\. No comparably large item\-level behavioral dataset exists for the locative alternation in English; locative results are therefore evaluated against the corpus\-based classification and effect\-size predictions fromAmbridgeet al\.\([2008](https://arxiv.org/html/2605.23039#bib.bib64)\), not against direct item\-level human judgments\. We flag this as a limitation\.222Conventional vs\. unconventional\.“Conventional” denotes the construction in which a verb is attested at higher frequency in our corpus \(e\.g\.,*donate*in the prepositional dative\); “unconventional” denotes the lower\-frequency alternative\. The terminology is descriptive \(about empirical distribution\), not normative \(about grammaticality\)\. For verbs in the*No preemption*category, both forms are conventional; we adopt the alphabetically first form as “conventional” for the purpose of setting the sign ofΔ​S\\Delta S\.

### 3\.3Language Models

We evaluate 14 models from four families:GPT\-2\(124M, 355M, 774M, 1\.5B\),Pythia\(160M, 410M, 1B, 2\.8B, 6\.9B, 12B\),LLaMA\-2\(7B, 13B, 70B\), andOLMo\(7B\)\. All are base \(non\-instruction\-tuned\) models, since instruction tuning can shift surprisal patterns in ways that complicate psycholinguistic interpretation\. Pythia provides controlled scaling \(same architecture and data across sizes\)\. OLMo enables direct training\-data verification via the public Dolma corpus\.

### 3\.4Measures

Surprisal differential \(Δ​S\\Delta S\)\.For each verbvv:

Δ​S​\(v\)=S¯​\(unconventional\)−S¯​\(conventional\)\\Delta S\(v\)=\\bar\{S\}\(\\text\{unconventional\}\)\-\\bar\{S\}\(\\text\{conventional\}\)\(2\)whereS¯​\(⋅\)\\bar\{S\}\(\\cdot\)denotes mean per\-word surprisal averaged across 5 sentence frames\.

Preemption score\.FollowingGoldberg \([2011](https://arxiv.org/html/2605.23039#bib.bib23)\):

Preempt​\(v\)=f​\(v,Cxconv\)\+1f​\(v,Cxconv\)\+f​\(v,Cxunconv\)\+2\\textsc\{Preempt\}\(v\)=\\frac\{f\(v,\\text\{Cx\}\_\{\\text\{conv\}\}\)\+1\}\{f\(v,\\text\{Cx\}\_\{\\text\{conv\}\}\)\+f\(v,\\text\{Cx\}\_\{\\text\{unconv\}\}\)\+2\}\(3\)wheref​\(v,Cx\)f\(v,\\text\{Cx\}\)is the frequency of verbvvin construction Cx, with Laplace smoothing\.Preempt\(vv\) is the corpus\-attested probability that verbvvoccurs in its conventional alternative; high values indicate that the conventional form dominates the verb’s distribution, which is the empirical condition under which preemption theory predicts the unconventional alternative becomes inaccessible\. We note that this operationalization is a*distributional proxy*for Goldberg’s theoretical concept of functional competition; we discuss this gap in §[8\.2](https://arxiv.org/html/2605.23039#S8.SS2)and partially address through control analyses\.

Entrenchment score\.Total log verb frequency:Entrench​\(v\)=log​∑Cxf​\(v,Cx\)\\textsc\{Entrench\}\(v\)=\\log\\sum\_\{\\text\{Cx\}\}f\(v,\\text\{Cx\}\)\. This deliberately simplified measure tracks the proposal that cumulative exposure to the verb in*any*context blocks use in novel constructions\(Brooks and Tomasello,[1999](https://arxiv.org/html/2605.23039#bib.bib36); Ambridge,[2020](https://arxiv.org/html/2605.23039#bib.bib27)\)\. Our operationalization follows the most directly comparable preemption–entrenchment literature\(Ambridgeet al\.,[2015](https://arxiv.org/html/2605.23039#bib.bib62),[2018](https://arxiv.org/html/2605.23039#bib.bib63)\); richer formulations \(e\.g\., construction\-aggregated frequency\) are sensitivity\-tested in Appendix[P](https://arxiv.org/html/2605.23039#A16)\.

Corpus parsing pipeline\.Preempt\(vv\) andEntrench\(vv\) are estimated by parsing each model’s training corpus \(Dolma for OLMo; the Pile as a proxy for other models,r=0\.94r=0\.94\)\. We use spaCy dependency parses with construction\-specific lexico\-syntactic templates: prepositional vs\. double\-object datives viadobj/prep/dativepatterns; transitive vs\. intransitive causatives via the presence of adobjand matrix\-verb morphology; content\- vs\. container\-locatives via PP\-head identity \(*into/onto*vs\.*with*\)\. Three filtering layers \(POS\-tag agreement, strict dependency\-pattern matching, and a preposition\-lemma whitelist\) mitigate noise in web text\. Pipeline precision, validated against manually annotated sentences, is in the 92–96% range across constructions \(Cohen’sκ∈\[0\.89,0\.94\]\\kappa\\in\[0\.89,0\.94\]\)\. Full templates, noise\-mitigation steps, and sensitivity analyses appear in Appendix[G](https://arxiv.org/html/2605.23039#A7)\.

### 3\.5Statistical Analysis

We use four complementary approaches: \(1\) pairedtt\-tests with Cohen’sdd, emphasizing effect sizes overpp\-values following methodological best practices\(Lakens,[2013](https://arxiv.org/html/2605.23039#bib.bib53)\); \(2\) Pearson correlations betweenΔ​S\\Delta Sand human acceptability, reported with 95% bootstrap confidence intervals \(10,000 resamples\); \(3\) mixed\-effects regressions with random intercepts and slopes for model; and \(4\) partial correlations\. Allpp\-values are FDR\-corrected\(Benjamini and Hochberg,[2018](https://arxiv.org/html/2605.23039#bib.bib57)\)across the full set of reported tests \(k=94k=94\)\.

## 4Experiment 1: Preemption Effects

### 4\.1Predictions

If LLMs capture the distributional signature of statistical preemption in English, we predict: \(H1a\)Δ​S\\Delta Swill be graded across preemption categories \(strong\>\>weak\>\>none\); and \(H1b\) per\-verbΔ​S\\Delta Svalues will correlate with human gradient acceptability at the item level\.

### 4\.2Results: Group Differences

All 14 models show the predicted graded pattern\. Table[2](https://arxiv.org/html/2605.23039#S4.T2)reports results for representative models\. For LLaMA\-2 7B, strongly preempted dative verbs yieldΔ​S=2\.41\\Delta S=2\.41bits/word \(SD=0\.89=0\.89\), weakly preempted verbs yieldΔ​S=1\.12\\Delta S=1\.12\(SD=0\.72=0\.72\), and non\-preempted verbs yieldΔ​S=0\.33\\Delta S=0\.33\(SD=0\.51=0\.51\)\. The difference between strong and no preemption is highly significant \(t​\(52\)=9\.87t\(52\)=9\.87,p<\.001p<\.001,d=2\.87d=2\.87\)\. Complete results for all 14 models and all three constructions are in Appendix[C](https://arxiv.org/html/2605.23039#A3)\.

Table 2:MeanΔ​S\\Delta S\(bits/word\) for dative verbs by preemption strength\.dS\-Nd\_\{\\text\{S\-N\}\}= Cohen’sddfor strong vs\. none\. Allp<\.001p<\.001\(FDR\-corrected\)\.
### 4\.3Results: Correlation with Human Judgments

Per\-verbΔ​S\\Delta Svalues correlate strongly with human acceptability from the DAIS benchmark \(Table[3](https://arxiv.org/html/2605.23039#S4.T3)\)\. LLaMA\-2 7B achievesr=0\.79r=0\.79\[0\.69, 0\.86\] \(p<\.001p<\.001,n=80n=80verbs\), and all models above 1B parameters exceedr=0\.70r=0\.70\. Correlations with Robenalt & Goldberg arer=0\.74r=0\.74\[0\.55, 0\.86\] \(24 verbs\)\. Correlations with Tachihara & Goldberg \(r=0\.76r=0\.76\[0\.64, 0\.85\] for LLaMA\-2 7B\) provide independent replication\. The consistency across three independently collected datasets, each using different judgment tasks, substantially increases confidence that the LLM\-human correspondence reflects genuine shared sensitivity\.

Table 3:Pearson correlations betweenΔ​S\\Delta Sand human acceptability\. CIs from 10,000 bootstrap resamples\. Allp<\.001p<\.001\.
### 4\.4Cross\-Construction Generalization

The preemption pattern extends beyond the dative\. For causative verbs \(LLaMA\-2 7B\), strongly preempted items yieldΔ​S=2\.17\\Delta S=2\.17versusΔ​S=0\.32\\Delta S=0\.32for non\-preempted items \(d=2\.34d=2\.34\)\. For locative verbs, the effect is more modest \(d=1\.42d=1\.42\), consistent with weaker preemption effects in human data\(Ambridgeet al\.,[2008](https://arxiv.org/html/2605.23039#bib.bib64)\)\. The effect\-size ordering \(datived=2\.87d=2\.87\>\>causatived=2\.34d=2\.34\>\>locatived=1\.42d=1\.42\) is identical in LLMs and humans \(p<\.0001p<\.0001, permutation test; held\-out replication:rtest=0\.77r\_\{\\text\{test\}\}=0\.77; Appendix[E](https://arxiv.org/html/2605.23039#A5)\)\.

## 5Experiment 2: Preemption vs\. Entrenchment

### 5\.1Design

The critical question is whether the observed effects reflect true preemption or merely entrenchment, i\.e\., whether a verb resists an unconventional frame because a*specific competing construction*is conventionally used in its place \(preemption\), or simply because the verb is heavily used in any construction at all \(entrenchment\)\. To dissociate these, we adapt[Winkleret al\.](https://arxiv.org/html/2605.23039#bib.bib58)’s \([2015](https://arxiv.org/html/2605.23039#bib.bib58)\) \+Competing/−\-Competing design\.

A verb is classified as\+Competingif our corpus annotation identifies a single conventional alternative construction that accounts for the dominant share \(≥\\geq60%\) of its uses with the relevant semantic role configuration, and this alternative is functionally equivalent to the unconventional frame for the same communicative context \(e\.g\.,*donate*in the prepositional dative\)\. A verb is classified as−\-Competingif no single construction dominates \(≤\\leq45% in any one frame\), or if no clearly functionally\-equivalent alternative exists \(e\.g\.,*swim*in causative use lacks a tight periphrastic competitor; meaning is expressed by varied paraphrastic strategies\)\. The key property of−\-Competing verbs is that even if they are frequent overall \(allowing entrenchment to apply\), they lack the single competing alternative preemption theory requires\.

We selected 20 \+Competing and 20−\-Competing verbs, matched on five potential confounds: log overall frequency, Levin verb\-class entropy, morphological complexity, register distribution, and concreteness\. No matched variable differs significantly between groups \(allp\>\.20p\>\.20; Table[9](https://arxiv.org/html/2605.23039#A6.T9)in Appendix[F](https://arxiv.org/html/2605.23039#A6)\)\. The logical structure: if LLM surprisal reflects preemption,Δ​S\\Delta Sshould be substantially larger for \+Competing verbs even though both groups are frequency\-matched; if entrenchment alone drove the effect, the two groups should behave similarly\.

### 5\.2Results

Table[4](https://arxiv.org/html/2605.23039#S5.T4)shows results decisively supporting preemption\. For LLaMA\-2 7B, \+Competing verbs yieldΔ​S=2\.36\\Delta S=2\.36\(SD=0\.84=0\.84\) versusΔ​S=0\.91\\Delta S=0\.91\(SD=0\.68=0\.68\) for –Competing verbs \(t​\(38\)=6\.02t\(38\)=6\.02,p<\.001p<\.001,d=1\.91d=1\.91\)\. The effect is robust across all models, withddranging from 1\.43 to 2\.18\.

Table 4:MeanΔ​S\\Delta Sfor frequency\-matched verbs with \(\+Comp\.\) vs\. without \(–Comp\.\) a competing conventional alternative\. Allp<\.001p<\.001\(FDR\-corrected\)\.0\.30\.30\.350\.350\.40\.40\.450\.450\.50\.50\.550\.550\.60\.60\.650\.650\.70\.70\.750\.750\.80\.80\.850\.850\.90\.90\.950\.95110112233Preempt\(vv\)Δ​S\\Delta S\(bits/word\)\+Competing–CompetingFigure 2:Preemption–entrenchment dissociation \(LLaMA\-2 7B\)\. \+Competing verbs \(blue, filled\) show a strong relationship betweenPreempt\(vv\) andΔ​S\\Delta S; –Competing verbs \(red, open\) cluster near zero regardless of frequency\. This dissociation is the key result: verb restrictions track the frequency of*competing*conventional forms, not overall verb frequency\.
### 5\.3Regression Analysis

We fit mixed\-effects regression models predictingΔ​S\\Delta Sfrom \(1\)Preempt\(vv\), \(2\)Entrench\(vv\), and \(3\) their interaction, with random intercepts and random slopes forPreemptby model\. The preemption score is the dominant predictor \(β=3\.41\\beta=3\.41, SE=0\.31=0\.31,t=11\.0t=11\.0,p<\.001p<\.001\), while entrenchment contributes modestly \(β=0\.19\\beta=0\.19, SE=0\.06=0\.06,t=3\.17t=3\.17,p=\.003p=\.003\)\. The interaction is not significant \(p=\.41p=\.41\)\. Partial correlations confirm the asymmetry: controlling for entrenchment, preemption explains substantial variance \(rpartial=0\.72r\_\{\\text\{partial\}\}=0\.72,p<\.001p<\.001\); controlling for preemption, entrenchment explains little \(rpartial=0\.24r\_\{\\text\{partial\}\}=0\.24,p=\.03p=\.03\)\. MarginalR2=0\.68R^\{2\}=0\.68; conditionalR2=0\.74R^\{2\}=0\.74\. Full diagnostics \(VIF=1\.34=1\.34; Shapiro\-WilkW=0\.987W=0\.987,p=\.12p=\.12\) in Appendix[D](https://arxiv.org/html/2605.23039#A4)\.

### 5\.4Non\-Circular Test: Triangulating LLM and Human Data

A legitimate concern is that bothPreempt\(vv\) andΔ​S\\Delta Sare functions of corpus statistics, making the corpus\-model regression partially circular\. We address this through a*triangulated*test that combines two complementary partial correlations against human acceptability:

\(i\) LLM\-level \(within §[5\.1](https://arxiv.org/html/2605.23039#S5.SS1)regression\)\.Controlling forEntrench\(vv\),Preempt\(vv\) predictsΔ​S\\Delta Swithrpartial=0\.72r\_\{\\text\{partial\}\}=0\.72; controlling forPreempt\(vv\),Entrench\(vv\) predictsΔ​S\\Delta Swith onlyrpartial=0\.24r\_\{\\text\{partial\}\}=0\.24\. Within LLM surprisal, preemption rather than entrenchment dominates\.

\(ii\) Corpus\-to\-human \(model\-independent\)\.Controlling forEntrench\(vv\), corpus\-derivedPreempt\(vv\) predicts DAIS human ratings withrpartial=0\.58r\_\{\\text\{partial\}\}=0\.58\[0\.42, 0\.71\] \(p<\.001p<\.001\)\. The reverse, namelyEntrench\(vv\) controlling forPreempt\(vv\), yields onlyrpartial=0\.12r\_\{\\text\{partial\}\}=0\.12\[−\-0\.10, 0\.33\] \(p=\.27p=\.27\)\. The same pattern holds for R&G ratings \(rpartial=0\.52r\_\{\\text\{partial\}\}=0\.52,p=\.009p=\.009; entrenchmentrpartial=0\.08r\_\{\\text\{partial\}\}=0\.08,p=\.71p=\.71\)\.

Together these decompose the LLM–human correspondence \(r=0\.79r=0\.79\) into two non\-circular sub\-links, each empirically asymmetric in favor of preemption: test \(i\) shows the LLM tracks the corpus distinction; test \(ii\) shows the same corpus distinction tracks human behavior with the LLM removed entirely\. Three further controls \(raw\-frequency, n\-gram, primacy\-of\-human\-data\) all converge on the same conclusion \(Appendix[D\.3](https://arxiv.org/html/2605.23039#A4.SS3)\)\.

## 6Experiment 3: Scaling Behavior

Using the Pythia suite \(160M–12B\), we track preemption sensitivity as a function of model parameters\. Table[5](https://arxiv.org/html/2605.23039#S6.T5)shows monotonic, continuous improvement\. Fitting a power law:

r​\(N\)=a⋅Nb\+cr\(N\)=a\\cdot N^\{b\}\+c\(4\)yieldsb=0\.092b=0\.092\[0\.071, 0\.113\], adjustedR2=0\.993R^\{2\}=0\.993\. The sublinear exponent indicates diminishing returns, consistent with power\-law scaling\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.23039#bib.bib7); Hoffmannet al\.,[2022](https://arxiv.org/html/2605.23039#bib.bib8)\)\. There is no sudden phase transition, consistent with[Schaefferet al\.](https://arxiv.org/html/2605.23039#bib.bib6)’s \([2023](https://arxiv.org/html/2605.23039#bib.bib6)\) finding that apparent emergent abilities\(Weiet al\.,[2022](https://arxiv.org/html/2605.23039#bib.bib5)\)are often metric artifacts\. Cross\-architecture comparison at 7B confirms generalizability \(Figure[3](https://arxiv.org/html/2605.23039#S6.F3)\)\. Alternative functional forms \(log\-linear, power law without intercept\) yield worse fits \(Appendix[I](https://arxiv.org/html/2605.23039#A9)\)\.

Table 5:Scaling in Pythia\. PPL = Wikitext\-103 perplexity\. CIs from bootstrap\.100M1B10B100B0\.40\.40\.50\.50\.60\.60\.70\.70\.80\.80\.90\.9Parameters \(log scale\)rr\(DAIS\)Power\-law fitPythiaLLaMA\-2 7BOLMo 7BLLaMA\-2 70BFigure 3:Scaling of preemption sensitivity\. Blue line: power\-law fit to Pythia suite \(b=0\.092b=0\.092\)\. Cross\-architecture models cluster near the Pythia trend\.
## 7Experiment 4: Causal Intervention

Experiments 1–3 establish correlational evidence\. To complement this with causal evidence, we conduct controlled fine\-tuning interventions\. We note at the outset that several recent studies \(notablyYaoet al\.\([2025](https://arxiv.org/html/2605.23039#bib.bib71)\)’s controlled rearing andMisra and Mahowald \([2024](https://arxiv.org/html/2605.23039#bib.bib52)\); Misra and Kim \([2024](https://arxiv.org/html/2605.23039#bib.bib70)\)’s controlled\-input training\) provide causal evidence of comparable or greater scope, since they manipulate the entire training trajectory rather than only a post\-training adjustment; we therefore do not claim a unique “causal” contribution\. Our intervention adds two diagnostic features absent from this prior work: replication across five random seeds, and a reverse\-direction control addressing concerns about tautology in frequency\-sensitive models\(Mueller,[2024](https://arxiv.org/html/2605.23039#bib.bib75)\)\.

### 7\.1Design

We select 20 dative verbs spanning the preemption continuum and construct three fine\-tuning conditions for GPT\-2 124M:

Amplified condition\.For 10 target verbs with moderate preemption, we create fine\-tuning data that*increases*the frequency of the conventional \(prepositional dative\) form by a factor of 3\.

Attenuated condition\.For the same 10 verbs, we*equalize*the frequency of both dative forms\.

Reverse\-direction condition\.For the same 10 verbs, we*increase the unconventional*\(double\-object\) form by a factor of 3, the opposite of what preemption theory identifies as the relevant manipulation\. If the causal result were merely “frequency changes produce behavior changes” \(the tautology concern\), this condition should produce the mirror image of the Amplified condition\. However, preemption theory predicts an asymmetry: increasing the competing*conventional*form should strengthen preemption more than increasing the unconventional form weakens it, because preemption operates through the inhibitory force of established alternatives\.

A set of 10control verbsreceives balanced data\. Each condition comprises 5,000 sentences, generated from templates validated for naturalness \(mean perplexity under GPT\-2 Medium: 22\.1; Appendix[H](https://arxiv.org/html/2605.23039#A8)\)\. We fine\-tune for 3 epochs with learning rate5×10−55\\times 10^\{\-5\}, replicated across5 random seeds\.

### 7\.2Results

Table 6:Causal intervention results \(GPT\-2 124M, 5 seeds\)\.Δ​Δ​S\\Delta\\Delta S= post minus pre\. The Amplified effect \(\+0\.73\) is significantly larger than the Reverse effect \(−\-0\.29\);p<\.001p<\.001\.Table[6](https://arxiv.org/html/2605.23039#S7.T6)confirms all predictions\. The Amplified condition increasesΔ​S\\Delta Sby 0\.73±\\pm0\.07 bits \(all 5 seeds positive, range \[\+0\.66, \+0\.84\];t​\(9\)=5\.21t\(9\)=5\.21,p<\.001p<\.001,d=1\.65d=1\.65\)\. The Attenuated condition decreasesΔ​S\\Delta Sby 0\.43±\\pm0\.05 bits \(d=1\.23d=1\.23\)\. Control verbs show no change\. TheReverse condition, importantly, produces a smaller effect \(−\-0\.29\) than the Attenuated condition \(−\-0\.43\), and the Amplified effect \(\+\+0\.73\) is significantly larger in magnitude than the Reverse effect \(−\-0\.29;t​\(18\)=4\.12t\(18\)=4\.12,p<\.001p<\.001\)\. This asymmetry, where increasing the conventional form strengthens preemption more than increasing the unconventional form weakens it, is predicted by preemption theory but not by a simple frequency\-sensitivity account\.

Non\-target verbs showed no systematic change \(Δ​Δ​S=\+0\.02±0\.05\\Delta\\Delta S=\+0\.02\\pm 0\.05,p=\.71p=\.71\), confirming verb\-specificity\.

AmplifiedAttenuatedReverseControl−0\.5\-0\.500\.50\.5110\.730\.73−0\.43\-0\.43−0\.29\-0\.293⋅10−23\\cdot 10^\{\-2\}0\.730\.73−0\.43\-0\.43−0\.29\-0\.293⋅10−23\\cdot 10^\{\-2\}Δ​Δ​S\\Delta\\Delta S\(bits/word\)Figure 4:Causal intervention effects across 5 random seeds\. Error bars show±\\pm1 SD\. The Amplified–Reverse asymmetry \(p<\.001p<\.001\) rules out simple frequency\-sensitivity as the explanation\.#### Addressing the tautology and asymmetry\-confound concerns\.

Three features counter the objection that Experiment 4 merely shows “changing frequencies in a frequency\-sensitive model produces frequency\-dependent behavior\.” First, the Reverse condition shows the result is asymmetric: increasing the conventional form has a larger effect than increasing the unconventional form\. Second, the effect is verb\-specific: non\-target verbs are unaffected\. Third,Δ​Δ​S\\Delta\\Delta Scorrelates with the change in preemption*ratio*\(r=0\.84r=0\.84\), not with raw frequency change \(r=0\.41r=0\.41\)\. We nevertheless acknowledge two alternative interpretations of the Amplified–Reverse asymmetry that our design does not fully rule out: \(a\) the pre\-training corpus is itself asymmetric \(conventional forms dominate\), so the Amplified condition reinforces an already\-dominant pattern while the Reverse condition must overcome it, so some of the \+0\.73 vs\.−\-0\.29 gap may reflect this prior; \(b\) embedding\-space neighborhood effects, in which manipulating one verb could propagate through clusters of semantically similar verbs\(Liet al\.,[2022](https://arxiv.org/html/2605.23039#bib.bib3)\)\. Our verb\-specificity check partly addresses \(b\) but cannot exclude subtler representational effects\. Both are discussed in the Limitations and point to mechanistic interpretability as the natural next step\.

## 8Implications for Linguistic Theory

### 8\.1Preemption as Distributional Learning

Our central finding, that LLMs trained on English text capture the distributional signature of statistical preemption, and that manipulating the relevant distributional variable in fine\-tuning data shifts behavior in the predicted direction, supports[Goldberg](https://arxiv.org/html/2605.23039#bib.bib22)’s \([2018](https://arxiv.org/html/2605.23039#bib.bib22)\) claim that preemption is learnable from positive evidence alone, without innate semantic verb\-class constraints\(Pinker,[1989](https://arxiv.org/html/2605.23039#bib.bib66)\)\. The non\-circular partial correlations \(§[5\.4](https://arxiv.org/html/2605.23039#S5.SS4)\) demonstrate that the same distributional variable predicts both LLM and human behavior\. This complements Bayesian accounts\(Perforset al\.,[2011](https://arxiv.org/html/2605.23039#bib.bib37)\)and[Yang](https://arxiv.org/html/2605.23039#bib.bib34)’s \([2015](https://arxiv.org/html/2605.23039#bib.bib34)\) Tolerance Principle\(Yang,[2016](https://arxiv.org/html/2605.23039#bib.bib35)\)by showing that multiple formalizations of how negative knowledge arises from distributional learning converge on similar predictions\(Warstadt and Bowman,[2022](https://arxiv.org/html/2605.23039#bib.bib11); Warstadtet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib72)\)\.

### 8\.2The Formal–Functional Divide

Mahowaldet al\.\([2023](https://arxiv.org/html/2605.23039#bib.bib50)\)argued that LLMs excel at formal linguistic competence while struggling with functional competence\. Our operationalization \(Eq\.[3](https://arxiv.org/html/2605.23039#S3.E3)\) is a distributional proxy that does not directly capture Goldberg’s theoretical concept of functional equivalence; high preemption scores could arise from pragmatic constraints, register effects, or structured regularities that a formal account could equally describe\. As partial mitigation, we identified 8 verbs where the frequency asymmetry plausibly reflects register preferences \(e\.g\.,*telegraph*,*cable*\) and excluded them; the preemption effect strengthened \(d=2\.08d=2\.08vs\.d=1\.91d=1\.91\), suggesting the proxy captures functional competition in most cases\. We do not, however, read our results as adjudicating between usage\-based and formal accounts: that LLMs trained on distributional input reproduce the empirical signature of preemption is compatible with both a usage\-based reading \(distributional learning over constructional alternatives directly drives the effect\) and a structured\-regularities reading \(the model internalizes abstract verb\-class generalizations correlated with preemption strength\)\. Our \+Competing/−\-Competing dissociation \(§[5\.1](https://arxiv.org/html/2605.23039#S5.SS1)\) constrains the latter but does not foreclose it; mechanistic interpretability\(Gevaet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib54); Conmyet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib18)\)is the natural next probe\.

### 8\.3Cross\-Linguistic Predictions

Our study tests preemption only in English, a significant limitation\. Preemption predictions differ across typologically diverse languages: in agglutinative languages like Turkish, preemption may operate over morphological alternations; in isolating languages like Mandarin, different construction types would be needed\(Ambridge,[2020](https://arxiv.org/html/2605.23039#bib.bib27)\)\. Resources such as WALS\(Dryer and Haspelmath,[2013](https://arxiv.org/html/2605.23039#bib.bib74)\), Grambank\(Skirgårdet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib55)\), andWilcoxet al\.\([2023](https://arxiv.org/html/2605.23039#bib.bib2)\)’s 11\-language surprisal dataset provide infrastructure for cross\-linguistic testing, which we regard as the most critical next step\.

### 8\.4Relationship to Semantic Verb\-Class Accounts

Pinker \([1989](https://arxiv.org/html/2605.23039#bib.bib66)\)proposed that dative restrictions arise from innate semantic constraints \(Narrow Range Rules\)\. A Pinkerite interpretation would hold that LLMs capture preemption because distributional patterns correlate with semantic verb classes\(Hovav and Levin,[2008](https://arxiv.org/html/2605.23039#bib.bib41)\)\. Our \+Competing/−\-Competing dissociation constrains this: frequency\-matched verbs from similar semantic classes show differentΔ​S\\Delta Sdepending on whether a competing form exists\. We cannot fully rule out that implicit semantic learning underlies both patterns; distinguishing the two requires testing verbs from the same narrow class with differing preemption strength\.

## 9Discussion

Our four experiments converge on a single finding: LLMs trained on English text reproduce the distributional signature of statistical preemption, causally modulated by the frequency of conventional competitors; non\-circular corpus\-to\-human partial correlations \(§[5\.4](https://arxiv.org/html/2605.23039#S5.SS4)\) confirm LLMs learn verb restrictions*specifically*where conventional alternatives are frequent, addressing the residual circularity that has dogged prior LLM probing work\(Ettinger,[2020](https://arxiv.org/html/2605.23039#bib.bib73); Linzen and Baroni,[2020](https://arxiv.org/html/2605.23039#bib.bib12)\); the Amplified–Reverse asymmetry from Experiment 4 \(\+\+0\.73 vs\.−\-0\.29\), with a reverse\-direction control ruling out frequency confounds, isolates the inhibitory force of*conventional*forms that preemption theory uniquely predicts\(Goldberg,[2018](https://arxiv.org/html/2605.23039#bib.bib22)\); residual errors on low\-frequency verbs \(*cable, telegraph*:Δ​S\>1\.5\\Delta S\>1\.5despite DAIS near 0\.50\) trace to register effects rather than preemption failure \(§[8\.2](https://arxiv.org/html/2605.23039#S8.SS2)\)\.

BLiMP\(Warstadtet al\.,[2020](https://arxiv.org/html/2605.23039#bib.bib10)\)asks*whether*language models register a form as unacceptable; we ask*why*, and find the answer in competition rather than exposure alone, placing our work within a broader program of treating LLMs as scientific instruments\(McCoyet al\.,[2024](https://arxiv.org/html/2605.23039#bib.bib68); Kalliniet al\.,[2024](https://arxiv.org/html/2605.23039#bib.bib17); Baroni,[2022](https://arxiv.org/html/2605.23039#bib.bib51); Huet al\.,[2020](https://arxiv.org/html/2605.23039#bib.bib15); Warstadtet al\.,[2020](https://arxiv.org/html/2605.23039#bib.bib10); Linzenet al\.,[2016](https://arxiv.org/html/2605.23039#bib.bib13); Gulordavaet al\.,[2018](https://arxiv.org/html/2605.23039#bib.bib16); Marvin and Linzen,[2018](https://arxiv.org/html/2605.23039#bib.bib14); Warstadtet al\.,[2019](https://arxiv.org/html/2605.23039#bib.bib9)\)and positioning preemption sensitivity as a natural evaluation target within the BabyLM Challenge\(Warstadtet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib72)\)for developmentally plausible language models\(Tachihara and Goldberg,[2025](https://arxiv.org/html/2605.23039#bib.bib61); Samaraet al\.,[2025](https://arxiv.org/html/2605.23039#bib.bib59)\)\.

#### Conclusion\.

Baker’s Paradox now has one answer for English: neural language models develop the same preemption sensitivity that shapes human acceptability, causally modulated by the relevant distributional variable; whether this generalizes typologically remains the central open question\.

#### Reproducibility\.

## Acknowledgments

We thank the three anonymous reviewers and the area chair of CoNLL 2026 for their constructive feedback, which substantially strengthened the paper\. We are particularly grateful for the suggestion to expand our discussion of corpus\-parsing methodology, to acknowledge the broader set of causal interventions in language\-model psycholinguistics, and to incorporateMisra and Kim \([2024](https://arxiv.org/html/2605.23039#bib.bib70)\)more centrally in our framing of distributional learning\. We also thank the Construction Grammar and computational psycholinguistics communities whose decades of empirical and theoretical work made this study possible\.

## Limitations

Several limitations bear on the interpretation of our findings\.

English\-only scope\.All claims are restricted to English and three construction types \(dative, causative, locative\); cross\-linguistic generalization, which is essential for any universal claim about preemption, remains untested\(Ambridge,[2020](https://arxiv.org/html/2605.23039#bib.bib27); Barak and Goldberg,[2017](https://arxiv.org/html/2605.23039#bib.bib19)\)\. Appendix[N](https://arxiv.org/html/2605.23039#A14)sketches specific, falsifiable predictions for typologically diverse languages, but the present paper does not test them\.

Locative human\-data asymmetry\.Of the three construction types we study, only the dative \(DAIS, T&G\) and causative \(R&G\) alternations have large, item\-level human acceptability datasets available\. The locative results are therefore evaluated only against the corpus\-based preemption classification and the effect\-size predictions derived from the human literature\(Ambridgeet al\.,[2008](https://arxiv.org/html/2605.23039#bib.bib64)\); they are not validated against item\-level human judgments\. Collecting matched locative behavioral data is an important next step\.

Distributional proxy for functional competition\.Our corpus\-based preemption scores \(Eq\.[3](https://arxiv.org/html/2605.23039#S3.E3)\) are distributional proxies for functional competition\(Goldberg,[2018](https://arxiv.org/html/2605.23039#bib.bib22)\)\. The register\-exclusion analysis \(§[8\.2](https://arxiv.org/html/2605.23039#S8.SS2)\) and the \+Competing/−\-Competing dissociation \(§[5\.1](https://arxiv.org/html/2605.23039#S5.SS1)\) mitigate but do not eliminate the gap between corpus\-distributional asymmetry and theoretically defined functional equivalence\.

Fine\-tuning does not reconstruct developmental learning\.Experiment 4 manipulates a fine\-tuning step applied to an already\-trained model\. This is causal evidence that LLM constructional preferences are*continuously sensitive*to relative competing\-form frequency, but it does not recreate the developmental trajectory by which preemption preferences are originally acquired\. Controlled\-rearing designs that manipulate training composition from initialization, such asYaoet al\.\([2025](https://arxiv.org/html/2605.23039#bib.bib71)\)’s andMisra and Mahowald \([2024](https://arxiv.org/html/2605.23039#bib.bib52)\); Misra and Kim \([2024](https://arxiv.org/html/2605.23039#bib.bib70)\)’s, are the stronger test of the developmental claim, and we view our results as converging with rather than displacing that line of work\.

Alternative interpretations of the Reverse asymmetry\.The Amplified–Reverse asymmetry \(\+0\.73\+0\.73vs\.−0\.29\-0\.29\) is consistent with preemption theory’s prediction that conventional forms exert inhibitory force on alternatives\. However, we cannot fully rule out two confounding accounts: \(a\) the pre\-training corpus’s pre\-existing imbalance favoring conventional over unconventional forms means fine\-tuning with additional conventional\-form data reinforces an already\-dominant pattern, whereas the Reverse condition must work against that pre\-existing mass; and \(b\) embedding\-space neighborhood effects, by which manipulating a target verb’s distribution may propagate through clusters of semantically similar verbs in the model’s representations\(Liet al\.,[2022](https://arxiv.org/html/2605.23039#bib.bib3)\)\. Our verb\-specificity check \(non\-target verbs unchanged,Δ​Δ​S=\+0\.02±0\.05\\Delta\\Delta S=\+0\.02\\pm 0\.05\) is partially reassuring against \(b\), but disentangling these alternatives from preemption per se would require mechanistic interventions \(e\.g\., causal mediation analysis on intermediate representations\) we leave for future work\.

Scale of the causal intervention\.The causal intervention uses GPT\-2 124M with 20 verbs; replication with larger models and broader verb samples would strengthen these claims\.

Reliance on existing human datasets\.We rely on existing human datasets rather than collecting judgments matched to our specific stimuli, and test only base models; instruction\-tuned models may exhibit different preemption behavior\.

No mechanistic probing\.Finally, we do not probe the internal mechanism by which preemption is implemented in model representations\(Gevaet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib54); Conmyet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib18); Linzen and Baroni,[2020](https://arxiv.org/html/2605.23039#bib.bib12)\); the formal\-vs\.\-functional reading of our results \(§[8\.2](https://arxiv.org/html/2605.23039#S8.SS2)\) cannot be fully adjudicated without such probing\.

## References

- B\. Ambridge, L\. Barak, E\. Wonnacott, C\. Bannard, and G\. Sala \(2018\)Effects of both preemption and entrenchment in the retreat from verb overgeneralization errors: Four reanalyses, an extended replication, and a meta\-analytic synthesis\.Collabra: Psychology4\(1\),pp\. Article 23\.External Links:[Document](https://dx.doi.org/10.1525/collabra.133)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p3.1),[§3\.4](https://arxiv.org/html/2605.23039#S3.SS4.p3.1)\.
- B\. Ambridge, A\. Bidgood, K\. E\. Twomey, J\. M\. Pine, C\. F\. Rowland, and D\. Freudenthal \(2015\)Preemption versus entrenchment: towards a construction\-general solution to the problem of the retreat from verb argument structure overgeneralization\.PLoS ONE10\(4\),pp\. e0123723\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0123723)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p3.1),[§3\.4](https://arxiv.org/html/2605.23039#S3.SS4.p3.1)\.
- B\. Ambridge, J\. M\. Pine, C\. F\. Rowland, and C\. R\. Young \(2008\)The effect of verb semantic class and verb frequency \(entrenchment\) on children’s and adults’ graded judgements of argument\-structure overgeneralization errors\.Cognition106\(1\),pp\. 87–129\.External Links:ISSN 0010\-0277,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cognition.2006.12.015),[Link](https://www.sciencedirect.com/science/article/pii/S0010027707000029)Cited by:[§M\.3](https://arxiv.org/html/2605.23039#A13.SS3.p1.1),[§3\.1](https://arxiv.org/html/2605.23039#S3.SS1.p3.1),[§4\.4](https://arxiv.org/html/2605.23039#S4.SS4.p1.11),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p3.1),[footnote 1](https://arxiv.org/html/2605.23039#footnote1)\.
- B\. Ambridge \(2020\)Against stored abstractions: A radical exemplar model of language acquisition\.First Language40\(5\-6\),pp\. 509–559\.External Links:[Document](https://dx.doi.org/10.1177/0142723719869731)Cited by:[1st item](https://arxiv.org/html/2605.23039#A14.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.23039#S3.SS4.p3.1),[§8\.3](https://arxiv.org/html/2605.23039#S8.SS3.p1.1),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p2.1)\.
- C\. L\. Baker \(1979\)Syntactic theory and the projection problem\.Linguistic Inquiry10\(4\),pp\. 533–581\.External Links:[Link](https://www.jstor.org/stable/4178133)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p1.1)\.
- L\. Barak and A\. E\. Goldberg \(2017\)Modeling the partial productivity of constructions\.In2017 AAAI Spring Symposia, Stanford University, Palo Alto, California, USA, March 27\-29, 2017,External Links:[Link](http://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15297)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.5),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p2.1)\.
- M\. Baroni \(2022\)On the proper role of linguistically oriented deep net analysis in linguistic theorising\.InAlgebraic Structures in Natural Language,pp\. 1–16\.External Links:ISBN 9781003205388,[Link](http://dx.doi.org/10.1201/9781003205388-1),[Document](https://dx.doi.org/10.1201/9781003205388-1)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- Y\. Benjamini and Y\. Hochberg \(2018\)Controlling the false discovery rate: a practical and powerful approach to multiple testing\.Journal of the Royal Statistical Society: Series B \(Methodological\)57\(1\),pp\. 289–300\.External Links:ISSN 0035\-9246,[Document](https://dx.doi.org/10.1111/j.2517-6161.1995.tb02031.x),[Link](https://doi.org/10.1111/j.2517-6161.1995.tb02031.x)Cited by:[§3\.5](https://arxiv.org/html/2605.23039#S3.SS5.p1.6)\.
- J\. K\. Boyd, E\. A\. Gottschalk, and A\. E\. Goldberg \(2009\)Linking rule acquisition in novel phrasal constructions\.Language Learning59\(s1\),pp\. 64–89\.External Links:ISSN 1467\-9922,[Link](http://dx.doi.org/10.1111/j.1467-9922.2009.00536.x),[Document](https://dx.doi.org/10.1111/j.1467-9922.2009.00536.x)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p2.1)\.
- J\. Bresnan, A\. Cueni, T\. Nikitina, and R\. H\. Baayen \(2007\)Predicting the dative alternation\.Cognitive Foundations of Interpretation\.Cited by:[§2\.2](https://arxiv.org/html/2605.23039#S2.SS2.p1.1)\.
- J\. Bresnan and M\. Ford \(2010\)Predicting syntax: processing dative constructions in american and australian varieties of english\.Language86\(1\),pp\. 168–213\.External Links:ISSN 1535\-0665,[Link](http://dx.doi.org/10.1353/lan.0.0189),[Document](https://dx.doi.org/10.1353/lan.0.0189)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.5)\.
- P\. J\. Brooks and M\. Tomasello \(1999\)How children constrain their argument structure constructions\.Language75\(4\),pp\. 720\.External Links:ISSN 0097\-8507,[Link](http://dx.doi.org/10.2307/417731),[Document](https://dx.doi.org/10.2307/417731)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.23039#S3.SS4.p3.1)\.
- J\. Bybee \(2010\)Language, usage and cognition\.Cambridge University Press\.External Links:ISBN 9780511750526,[Link](http://dx.doi.org/10.1017/cbo9780511750526),[Document](https://dx.doi.org/10.1017/cbo9780511750526)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.4)\.
- A\. Conmy, A\. N\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html)Cited by:[§8\.2](https://arxiv.org/html/2605.23039#S8.SS2.p1.3),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p9.1)\.
- S\. Dey and W\. G\. Sakas \(2025\)Performance and competence intertwined: a computational model of the null subject stage in English\-speaking children\.InProceedings of the Second International Workshop on Construction Grammars and NLP \(CxGsNLP 2025\),C\. Bonial, M\. Torgbi, L\. Weissweiler, A\. Blodgett, K\. Beuls, P\. Van Eecke, and H\. Tayyar Madabushi \(Eds\.\),Düsseldorf, Germany,pp\. 96–108\.External Links:ISBN 979\-8\-89176\-318\-0,[Link](https://aclanthology.org/2025.cxgsnlp-1.10/)Cited by:[Appendix O](https://arxiv.org/html/2605.23039#A15.p1.1)\.
- H\. Diessel \(2019\)The grammar network: how linguistic structure is shaped by language use\.Cambridge University Press\.External Links:ISBN 9781108712767,[Link](http://dx.doi.org/10.1017/9781108671040),[Document](https://dx.doi.org/10.1017/9781108671040)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.4)\.
- M\. S\. Dryer and M\. Haspelmath \(2013\)The World Atlas of Language Structures Online \(v2020\.4\) \[data set\]\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.13950591),[Link](https://wals.info/)Cited by:[3rd item](https://arxiv.org/html/2605.23039#A14.I3.i3.p1.1),[§8\.3](https://arxiv.org/html/2605.23039#S8.SS3.p1.1)\.
- N\. C\. Ellis \(2002\)FREQUENCY effects in language processing: a review with implications for theories of implicit and explicit language acquisition\.Studies in Second Language Acquisition24\(2\),pp\. 143–188\.External Links:ISSN 1470\-1545,[Link](http://dx.doi.org/10.1017/s0272263102002024),[Document](https://dx.doi.org/10.1017/s0272263102002024)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p3.1)\.
- A\. Ettinger \(2020\)What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models\.Trans\. Assoc\. Comput\. Linguistics8,pp\. 34–48\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00298),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00298)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p1.3)\.
- R\. Futrell, E\. Wilcox, T\. Morita, P\. Qian, M\. Ballesteros, and R\. Levy \(2019\)Neural language models as psycholinguistic subjects: representations of syntactic state\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),pp\. 32–42\.External Links:[Link](https://doi.org/10.18653/v1/n19-1004),[Document](https://dx.doi.org/10.18653/V1/N19-1004)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p4.3)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 12216–12235\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.751),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.751)Cited by:[§8\.2](https://arxiv.org/html/2605.23039#S8.SS2.p1.3),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p9.1)\.
- A\. E\. Goldberg \(1995\)Constructions: A construction grammar approach to argument structure\.Cognitive Theory of Language and Culture Series,University of Chicago Press,Chicago\.External Links:ISBN 978\-0226300863Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.4)\.
- A\. E\. Goldberg \(2011\)Corpus evidence of the viability of statistical preemption\.Cognitive Linguistics: The Quantitative Turn22\(1\),pp\. 131–153\.External Links:[Link](http://dx.doi.org/10.1515/9783110335255.57),[Document](https://dx.doi.org/10.1515/9783110335255.57)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.4),[§3\.4](https://arxiv.org/html/2605.23039#S3.SS4.p2.5)\.
- A\. E\. Goldberg \(2016\)Partial productivity of linguistic constructions: dynamic categorization and statistical preemption\.Language and Cognition8\(3\),pp\. 369–390\.External Links:ISSN 1866\-9859,[Link](http://dx.doi.org/10.1017/langcog.2016.17),[Document](https://dx.doi.org/10.1017/langcog.2016.17)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.5)\.
- A\. Goldberg \(2005\)Constructions at work: the nature of generalization in language\.Oxford University PressOxford\.External Links:ISBN 9780191708428,[Link](http://dx.doi.org/10.1093/acprof:oso/9780199268511.001.0001),[Document](https://dx.doi.org/10.1093/acprof%3Aoso/9780199268511.001.0001)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.4)\.
- A\. Goldberg \(2018\)Explain me this: creativity, competition, and the partial productivity of constructions\.Princeton University Press\.External Links:ISBN 9780691183954,[Link](http://dx.doi.org/10.1515/9780691183954),[Document](https://dx.doi.org/10.1515/9780691183954)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p1.5),[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1),[§9](https://arxiv.org/html/2605.23039#S9.p1.3),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p4.1)\.
- A\. Goodkind and K\. Bicknell \(2018\)Predictive power of word surprisal for reading times is a linear function of language model quality\.InProceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics, CMCL 2018, Salt Lake City, Utah, USA, January 7, 2018,A\. B\. Sayeed, C\. Jacobs, T\. Linzen, and M\. van Schijndel \(Eds\.\),pp\. 10–18\.External Links:[Link](https://doi.org/10.18653/v1/w18-0102),[Document](https://dx.doi.org/10.18653/V1/W18-0102)Cited by:[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- J\. Gropen, S\. Pinker, M\. Hollander, R\. Goldberg, and R\. Wilson \(1989\)The learnability and acquisition of the dative alternation in english\.Language65\(2\),pp\. 203\.External Links:ISSN 0097\-8507,[Link](http://dx.doi.org/10.2307/415332),[Document](https://dx.doi.org/10.2307/415332)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.23039#S2.SS2.p1.1)\.
- K\. Gulordava, P\. Bojanowski, E\. Grave, T\. Linzen, and M\. Baroni \(2018\)Colorless green recurrent networks dream hierarchically\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2018, New Orleans, Louisiana, USA, June 1\-6, 2018, Volume 1 \(Long Papers\),M\. A\. Walker, H\. Ji, and A\. Stent \(Eds\.\),pp\. 1195–1205\.External Links:[Link](https://doi.org/10.18653/v1/n18-1108),[Document](https://dx.doi.org/10.18653/V1/N18-1108)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- J\. Hale \(2001\)A probabilistic earley parser as a psycholinguistic model\.InLanguage Technologies 2001: The Second Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL 2001, Pittsburgh, PA, USA, June 2\-7, 2001,External Links:[Link](https://aclanthology.org/N01-1021/)Cited by:[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- R\. X\. D\. Hawkins, T\. Yamakoshi, T\. L\. Griffiths, and A\. E\. Goldberg \(2020\)Investigating representations of verb bias in neural language models\.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16\-20, 2020,pp\. 4653–4663\.External Links:[Link](https://doi.org/10.18653/v1/2020.emnlp-main.376),[Document](https://dx.doi.org/10.18653/V1/2020.EMNLP-MAIN.376)Cited by:[§2\.2](https://arxiv.org/html/2605.23039#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.23039#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2605.23039#S3.SS2.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, O\. Vinyals, J\. W\. Rae, and L\. Sifre \(2022\)Training compute\-optimal large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§6](https://arxiv.org/html/2605.23039#S6.p1.2)\.
- M\. R\. Hovav and B\. Levin \(2008\)The english dative alternation: the case for verb sensitivity\.Journal of Linguistics44\(1\),pp\. 129–167\.External Links:ISSN 1469\-7742,[Link](http://dx.doi.org/10.1017/s0022226707004975),[Document](https://dx.doi.org/10.1017/s0022226707004975)Cited by:[§2\.2](https://arxiv.org/html/2605.23039#S2.SS2.p1.1),[§8\.4](https://arxiv.org/html/2605.23039#S8.SS4.p1.2)\.
- K\. Howitt, S\. Dey, and W\. G\. Sakas \(2021\)Gradual syntactic triggering: the gradient parameter hypothesis\.Language Acquisition28\(1\),pp\. 65–96\.External Links:[Document](https://dx.doi.org/10.1080/10489223.2020.1803329)Cited by:[Appendix O](https://arxiv.org/html/2605.23039#A15.p1.1)\.
- J\. Hu, J\. Gauthier, P\. Qian, E\. Wilcox, and R\. Levy \(2020\)A systematic assessment of syntactic generalization in neural language models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. R\. Tetreault \(Eds\.\),pp\. 1725–1744\.External Links:[Link](https://doi.org/10.18653/v1/2020.acl-main.158),[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.158)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- J\. Hu, K\. Mahowald, G\. Lupyan, A\. Ivanova, and R\. Levy \(2024\)Language models align with human judgments on key grammatical constructions\.Proceedings of the National Academy of Sciences121\(36\),pp\. e2400917121\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2400917121),[Link](https://www.pnas.org/doi/arXiv.10.1073/pnas.2400917121)Cited by:[Appendix K](https://arxiv.org/html/2605.23039#A11.p1.1),[§1](https://arxiv.org/html/2605.23039#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- J\. Kallini, I\. Papadimitriou, R\. Futrell, K\. Mahowald, and C\. Potts \(2024\)Mission: impossible language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 14691–14714\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.787),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.787)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprintarXiv\.2001\.08361\.External Links:[Link](https://arxiv.org/arXiv.2001.08361)Cited by:[§6](https://arxiv.org/html/2605.23039#S6.p1.2)\.
- D\. Lakens \(2013\)Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t\-tests and anovas\.Frontiers in Psychology4\.External Links:ISSN 1664\-1078,[Link](http://dx.doi.org/10.3389/fpsyg.2013.00863),[Document](https://dx.doi.org/10.3389/fpsyg.2013.00863)Cited by:[§3\.5](https://arxiv.org/html/2605.23039#S3.SS5.p1.6)\.
- J\. H\. Lau, A\. Clark, and S\. Lappin \(2017\)Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge\.Cogn\. Sci\.41\(5\),pp\. 1202–1241\.External Links:[Link](https://doi.org/10.1111/cogs.12414),[Document](https://dx.doi.org/10.1111/COGS.12414)Cited by:[§D\.2](https://arxiv.org/html/2605.23039#A4.SS2.p1.2)\.
- B\. Levin \(1993\)English verb classes and alternations : a preliminaryinvestigation\.University of Chicago press\.Cited by:[Appendix E](https://arxiv.org/html/2605.23039#A5.p1.2),[§2\.2](https://arxiv.org/html/2605.23039#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.23039#S3.SS1.p2.1)\.
- R\. Levy \(2008\)Expectation\-based syntactic comprehension\.Cognition106\(3\),pp\. 1126–1177\.External Links:[Document](https://dx.doi.org/10.1016/j.cognition.2007.05.006)Cited by:[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- B\. Li, Z\. Zhu, G\. Thomas, F\. Rudzicz, and Y\. Xu \(2022\)Neural reality of argument structure constructions\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2022, Dublin, Ireland, May 22\-27, 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),pp\. 7410–7423\.External Links:[Link](https://doi.org/10.18653/v1/2022.acl-long.512),[Document](https://dx.doi.org/10.18653/V1/2022.ACL-LONG.512)Cited by:[§2\.4](https://arxiv.org/html/2605.23039#S2.SS4.p1.1),[§7\.2](https://arxiv.org/html/2605.23039#S7.SS2.SSS0.Px1.p1.4),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p6.3)\.
- T\. Linzen and M\. Baroni \(2020\)Syntactic structure from deep learning\.arXiv preprintarXiv\.2004\.10827\.External Links:[Link](https://arxiv.org/arXiv.2004.10827)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p1.3),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p9.1)\.
- T\. Linzen, E\. Dupoux, and Y\. Goldberg \(2016\)Assessing the ability of lstms to learn syntax\-sensitive dependencies\.Trans\. Assoc\. Comput\. Linguistics4,pp\. 521–535\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00115),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00115)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- K\. Mahowald, A\. A\. Ivanova, I\. A\. Blank, N\. Kanwisher, J\. B\. Tenenbaum, and E\. Fedorenko \(2023\)Dissociating language and thought in large language models: a cognitive perspective\.arXiv preprintarXiv\.2301\.06627\.External Links:[Link](https://doi.org/10.48550/arXiv.2301.06627),[Document](https://dx.doi.org/10.48550/ARXIV.2301.06627)Cited by:[§8\.2](https://arxiv.org/html/2605.23039#S8.SS2.p1.3)\.
- R\. Marvin and T\. Linzen \(2018\)Targeted syntactic evaluation of language models\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 1192–1202\.External Links:[Link](https://doi.org/10.18653/v1/d18-1151),[Document](https://dx.doi.org/10.18653/V1/D18-1151)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- R\. T\. McCoy, S\. Yao, D\. Friedman, M\. D\. Hardy, and T\. L\. Griffiths \(2024\)Embers of autoregression show how large language models are shaped by the problem they are trained to solve\.Proceedings of the National Academy of Sciences121\(41\),pp\. e2322420121\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2322420121),[Link](https://www.pnas.org/doi/arXiv.10.1073/pnas.2322420121)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- J\. A\. Michaelov, M\. D\. Bardolph, C\. K\. Van Petten, B\. K\. Bergen, and S\. Coulson \(2024\)Strong prediction: language model surprisal explains multiple n400 effects\.Neurobiology of Language5\(1\),pp\. 107–135\.External Links:ISSN 2641\-4368,[Link](http://dx.doi.org/10.1162/nol_a_00105),[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00105)Cited by:[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- K\. Misra and N\. Kim \(2024\)Generating novel experimental hypotheses from language models: A case study on cross\-dative generalization\.arXiv preprintarXiv\./2408\.05086\.External Links:[Link](https://doi.org/10.48550/arXiv.2408.05086),[Document](https://dx.doi.org/10.48550/ARXIV.2408.05086)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§2\.4](https://arxiv.org/html/2605.23039#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2605.23039#S2.T1),[§7](https://arxiv.org/html/2605.23039#S7.p1.1),[Acknowledgments](https://arxiv.org/html/2605.23039#Sx1.p1.1),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p5.1)\.
- K\. Misra and K\. Mahowald \(2024\)Language models learn rare phenomena from less rare phenomena: the case of the missing aanns\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 913–929\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.53),[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.53)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p3.1),[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§2\.4](https://arxiv.org/html/2605.23039#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2605.23039#S2.T1),[§7](https://arxiv.org/html/2605.23039#S7.p1.1),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p5.1)\.
- A\. Mueller \(2024\)Missed causes and ambiguous effects: counterfactuals pose challenges for interpreting neural networks\.arXiv preprintarXiv\.2407\.04690\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.04690),[Document](https://dx.doi.org/10.48550/ARXIV.2407.04690)Cited by:[§7](https://arxiv.org/html/2605.23039#S7.p1.1)\.
- F\. Perek \(2014\)Vector spaces for historical linguistics: using distributional semantics to study syntactic productivity in diachrony\.InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22\-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers,pp\. 309–314\.External Links:[Link](https://doi.org/10.3115/v1/p14-2051),[Document](https://dx.doi.org/10.3115/V1/P14-2051)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p3.1)\.
- F\. Perek \(2015\)Argument structure in usage\-based construction grammar: experimental and corpus\-based perspectives\.John Benjamins Publishing Company\.External Links:ISBN 9789027268754,ISSN 1573\-594X,[Link](http://dx.doi.org/10.1075/cal.17),[Document](https://dx.doi.org/10.1075/cal.17)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p3.1)\.
- A\. Perfors, J\. B\. Tenenbaum, and T\. Regier \(2011\)The learnability of abstract syntactic principles\.Cognition118\(3\),pp\. 306–338\.External Links:ISSN 0010\-0277,[Link](http://dx.doi.org/10.1016/j.cognition.2010.11.001),[Document](https://dx.doi.org/10.1016/j.cognition.2010.11.001)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p2.1),[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1)\.
- S\. Pinker \(1989\)Learnability and cognition: the acquisition of argument structure\.Learning, Development, and Conceptual Change,MIT Press,Cambridge, MA\.Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p1.1),[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1),[§8\.4](https://arxiv.org/html/2605.23039#S8.SS4.p1.2)\.
- A\. Samara, E\. Wonnacott, G\. Saxena, R\. Maitreyee, J\. Fazekas, and B\. Ambridge \(2025\)Learners restrict their linguistic generalizations using preemption but not entrenchment: Evidence from artificial\-language\-learning studies with adults and children\.Psychological Review132\(1\),pp\. 1–17\.External Links:[Document](https://dx.doi.org/10.1037/rev0000463)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p2.1),[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- R\. Schaeffer, B\. Miranda, and S\. Koyejo \(2023\)Are emergent abilities of large language models a mirage?\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/adc98a266f45005c403b8311ca7e8bd7-Abstract-Conference.html)Cited by:[§6](https://arxiv.org/html/2605.23039#S6.p1.2)\.
- W\. Scivetti and N\. Schneider \(2025\)Construction identification and disambiguation using BERT: a case study of NPN\.InProceedings of the 29th Conference on Computational Natural Language Learning,G\. Boleda and M\. Roth \(Eds\.\),Vienna, Austria,pp\. 365–376\.External Links:[Link](https://aclanthology.org/2025.conll-1.24/),[Document](https://dx.doi.org/10.18653/v1/2025.conll-1.24),ISBN 979\-8\-89176\-271\-8Cited by:[§2\.4](https://arxiv.org/html/2605.23039#S2.SS4.p1.1)\.
- C\. Shain, C\. Meister, T\. Pimentel, R\. Cotterell, and R\. Levy \(2024\)Large\-scale evidence for logarithmic effects of word predictability on reading time\.Proceedings of the National Academy of Sciences121\(10\),pp\. e2307876121\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2307876121)Cited by:[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- H\. Skirgård, H\. J\. Haynie, D\. E\. Blasi, H\. Hammarström, J\. Collins, J\. J\. Latarche, J\. Lesage, T\. Weber, A\. Witzlack\-Makarevich, S\. Passmore, A\. Chira, L\. Maurits, R\. Dinnage, M\. Dunn, G\. Reesink, R\. Singer, C\. Bowern, P\. Epps, J\. Hill, O\. Vesakoski, M\. Robbeets, N\. K\. Abbas, D\. Auer, N\. A\. Bakker, G\. Barbos, R\. D\. Borges, S\. Danielsen, L\. Dorenbusch, E\. Dorn, J\. Elliott, G\. Falcone, J\. Fischer, Y\. Ghanggo Ate, H\. Gibson, H\. Göbel, J\. A\. Goodall, V\. Gruner, A\. Harvey, R\. Hayes, L\. Heer, R\. E\. Herrera Miranda, N\. Hübler, B\. Huntington\-Rainey, J\. K\. Ivani, M\. Johns, E\. Just, E\. Kashima, C\. Kipf, J\. V\. Klingenberg, N\. König, A\. Koti, R\. G\. A\. Kowalik, O\. Krasnoukhova, N\. L\. M\. Lindvall, M\. Lorenzen, H\. Lutzenberger, T\. R\. A\. Martins, C\. Mata German, S\. van der Meer, J\. Montoya Samamé, M\. Müller, S\. Muradoglu, K\. Neely, J\. Nickel, M\. Norvik, C\. A\. Oluoch, J\. Peacock, I\. O\. C\. Pearey, N\. Peck, S\. Petit, S\. Pieper, M\. Poblete, D\. Prestipino, L\. Raabe, A\. Raja, J\. Reimringer, S\. C\. Rey, J\. Rizaew, E\. Ruppert, K\. K\. Salmon, J\. Sammet, R\. Schembri, L\. Schlabbach, F\. W\. P\. Schmidt, A\. Skilton, W\. D\. Smith, H\. de Sousa, K\. Sverredal, D\. Valle, J\. Vera, J\. Voß, T\. Witte, H\. Wu, S\. Yam, J\. Ye, M\. Yong, T\. Yuditha, R\. Zariquiey, R\. Forkel, N\. Evans, S\. C\. Levinson, M\. Haspelmath, S\. J\. Greenhill, Q\. D\. Atkinson, and R\. D\. Gray \(2023\)Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss\.Science Advances9\(16\)\.External Links:ISSN 2375\-2548,[Link](http://dx.doi.org/10.1126/sciadv.adg6175),[Document](https://dx.doi.org/10.1126/sciadv.adg6175)Cited by:[3rd item](https://arxiv.org/html/2605.23039#A14.I3.i3.p1.1),[§8\.3](https://arxiv.org/html/2605.23039#S8.SS3.p1.1)\.
- N\. J\. Smith and R\. Levy \(2013\)The effect of word predictability on reading time is logarithmic\.Cognition128\(3\),pp\. 302–319\.External Links:ISSN 0010\-0277,[Link](http://dx.doi.org/10.1016/j.cognition.2013.02.013),[Document](https://dx.doi.org/10.1016/j.cognition.2013.02.013)Cited by:[§2\.3](https://arxiv.org/html/2605.23039#S2.SS3.p1.1)\.
- A\. Stefanowitsch \(2006\)Negative evidence and the raw frequency fallacy\.Corpus Linguistics and Linguistic Theory2\(1\),pp\. 61–77\.External Links:[Link](https://doi.org/10.1515/CLLT.2006.003),[Document](https://dx.doi.org/doi%3A10.1515/CLLT.2006.003)Cited by:[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p3.1)\.
- K\. Tachihara and A\. E\. Goldberg \(2020\)Reduced competition effects and noisier representations in a second language\.Language Learning70\(1\),pp\. 219–265\.External Links:[Document](https://dx.doi.org/10.1111/lang.12375)Cited by:[§3\.2](https://arxiv.org/html/2605.23039#S3.SS2.p1.1)\.
- K\. Tachihara and A\. E\. Goldberg \(2025\)Learning unacceptability: repeated exposure to acceptable sentences improves adult learners’ recognition of unacceptable sentences\.Language Learning75\(1\),pp\. 77–116\.External Links:[Document](https://dx.doi.org/10.1111/lang.12660)Cited by:[§O\.2](https://arxiv.org/html/2605.23039#A15.SS2.p1.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p2.1),[§3\.2](https://arxiv.org/html/2605.23039#S3.SS2.p1.1),[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- A\. Warstadt and S\. R\. Bowman \(2022\)What artificial neural networks can tell us about human language acquisition\.arXiv preprintarXiv\.2208\.07998\.External Links:[Link](https://doi.org/10.48550/arXiv.2208.07998),[Document](https://dx.doi.org/10.48550/ARXIV.2208.07998)Cited by:[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1)\.
- A\. Warstadt, A\. Mueller, L\. Choshen, E\. Wilcox, C\. Zhuang, J\. Ciro, R\. Mosquera, B\. Paranjabe, A\. Williams, T\. Linzen, and R\. Cotterell \(2023\)Findings of the BabyLM challenge: sample\-efficient pretraining on developmentally plausible corpora\.Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning,pp\. 1–34\.External Links:[Link](https://aclanthology.org/2023.conll-babylm.1/),[Document](https://dx.doi.org/10.18653/v1/2023.conll-babylm.1)Cited by:[Appendix O](https://arxiv.org/html/2605.23039#A15.p1.1),[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1),[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- A\. Warstadt, A\. Parrish, H\. Liu, A\. Mohananey, W\. Peng, S\. Wang, and S\. R\. Bowman \(2020\)BLiMP: the benchmark of linguistic minimal pairs for english\.Trans\. Assoc\. Comput\. Linguistics8,pp\. 377–392\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00321),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00321)Cited by:[Appendix K](https://arxiv.org/html/2605.23039#A11.p1.1),[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- A\. Warstadt, A\. Singh, and S\. R\. Bowman \(2019\)Neural network acceptability judgments\.Trans\. Assoc\. Comput\. Linguistics7,pp\. 625–641\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00290),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00290)Cited by:[§9](https://arxiv.org/html/2605.23039#S9.p2.1)\.
- J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler, E\. H\. Chi, T\. Hashimoto, O\. Vinyals, P\. Liang, J\. Dean, and W\. Fedus \(2022\)Emergent abilities of large language models\.Trans\. Mach\. Learn\. Res\.2022\.External Links:[Link](https://openreview.net/forum?id=yzkSU5zdwD)Cited by:[§6](https://arxiv.org/html/2605.23039#S6.p1.2)\.
- L\. Weissweiler, T\. He, N\. Otani, D\. R\. Mortensen, L\. S\. Levin, and H\. Schütze \(2023\)Construction grammar provides unique insight into neural language models\.arXiv preprintarXiv\.2302\.02178\.External Links:[Link](https://doi.org/10.48550/arXiv.2302.02178),[Document](https://dx.doi.org/10.48550/ARXIV.2302.02178)Cited by:[§2\.4](https://arxiv.org/html/2605.23039#S2.SS4.p1.1)\.
- E\. G\. Wilcox, T\. Pimentel, C\. Meister, R\. Cotterell, and R\. P\. Levy \(2023\)Testing the predictions of surprisal theory in 11 languages\.Trans\. Assoc\. Comput\. Linguistics11,pp\. 1451–1470\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00612),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00612)Cited by:[3rd item](https://arxiv.org/html/2605.23039#A14.I3.i3.p1.1),[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§8\.3](https://arxiv.org/html/2605.23039#S8.SS3.p1.1)\.
- I\. Winkler, M\. Glauer, T\. Betsch, and P\. Sedlmeier \(2015\)The impact of attention on judgments of frequency and duration\.PLoS ONE10\(5\),pp\. e0126974\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0126974)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p2.1),[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.23039#S3.SS1.p3.1),[§3\.2](https://arxiv.org/html/2605.23039#S3.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.23039#S5.SS1.p1.1)\.
- E\. Wonnacott, E\. L\. Newport, and M\. K\. Tanenhaus \(2008\)Acquiring and processing verb argument structure: distributional learning in a miniature language\.Cognitive Psychology56\(3\),pp\. 165–209\.External Links:ISSN 0010\-0285,[Link](http://dx.doi.org/10.1016/j.cogpsych.2007.04.002),[Document](https://dx.doi.org/10.1016/j.cogpsych.2007.04.002)Cited by:[§1](https://arxiv.org/html/2605.23039#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.23039#S2.SS1.p2.1)\.
- C\. Yang \(2015\)Negative knowledge from positive evidence\.Language91\(4\),pp\. 938–953\.External Links:ISSN 1535\-0665,[Link](http://dx.doi.org/10.1353/lan.2015.0054),[Document](https://dx.doi.org/10.1353/lan.2015.0054)Cited by:[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1)\.
- C\. Yang \(2016\)The price of linguistic productivity: how children learn to break the rules of language\.The MIT Press\.External Links:ISBN 9780262336376,[Link](http://dx.doi.org/10.7551/mitpress/9780262035323.001.0001),[Document](https://dx.doi.org/10.7551/mitpress/9780262035323.001.0001)Cited by:[§8\.1](https://arxiv.org/html/2605.23039#S8.SS1.p1.1)\.
- Q\. Yao, K\. Misra, L\. Weissweiler, and K\. Mahowald \(2025\)Both direct and indirect evidence contribute to dative alternation preferences in language models\.arXiv preprintarXiv\.2503\.20850\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.20850),[Document](https://dx.doi.org/10.48550/ARXIV.2503.20850)Cited by:[Appendix Q](https://arxiv.org/html/2605.23039#A17.p1.1),[§1](https://arxiv.org/html/2605.23039#S1.p3.1),[§1](https://arxiv.org/html/2605.23039#S1.p4.3),[§2\.4](https://arxiv.org/html/2605.23039#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2605.23039#S2.T1),[§7](https://arxiv.org/html/2605.23039#S7.p1.1),[Limitations](https://arxiv.org/html/2605.23039#Sx2.p5.1)\.

## Appendix AStimulus Materials

### A\.1Dative Verbs by Preemption Category

Strong preemption \(27 verbs\):*donate, explain, whisper, mutter, announce, confess, demonstrate, describe, dictate, illustrate, mention, murmur, narrate, portray, proclaim, propose, recite, recommend, recount, relay, report, return, say, shout, suggest, transfer, yell*\.

Weak preemption \(26 verbs\):*carry, deliver, drive, ferry, fly, hand, haul, kick, lend, mail, move, pass, pull, push, read, rent, serve, ship, slide, take, throw, toss, wire, write, cable, telegraph*\.

No preemption \(27 verbs\):*bring, feed, give, grant, leave, loan, offer, owe, pay, promise, sell, send, show, teach, tell, wish, award, deal, flip, forward, guarantee, pitch, quote, refund, repay, trade, float*\.

### A\.2Causative and Locative Verbs

Causative \(20 verbs\):*Strong*: disappear, vanish, die, faint, blush, cry, laugh, sneeze, sleep, arrive\.*None*: melt, bounce, open, close, break, grow, change, turn, roll, slide\.

Locative \(20 verbs\):*Strong*: pour, drip, dump, dribble, drizzle, squeeze, scatter, sprinkle, splash, squirt\.*None*: spray, load, pack, stuff, wrap, smear, spread, stock, cram, fill\.

## Appendix BSentence Template Controls

Each verb was embedded in 5 matched sentence contexts controlling for subject identity \(five human agents\), theme/recipient \(matched for length, animacy, definiteness\), tense \(simple past throughout\), and length \(matched within±\\pm2 words; mean: 8\.4 words, SD: 1\.2\)\.

Example for*donate*\(Strong, dative\):

1. 1\.She donated the paintings to the museum\. / \*She donated the museum the paintings\.
2. 2\.The professor donated his collection to the university\. / \*The professor donated the university his collection\.
3. 3\.My neighbor donated her old clothes to the shelter\. / \*My neighbor donated the shelter her old clothes\.
4. 4\.The company donated computers to the school\. / \*The company donated the school computers\.
5. 5\.His family donated their savings to the foundation\. / \*His family donated the foundation their savings\.

Example for*give*\(None, dative\):

1. 1\.She gave the flowers to the teacher\. / She gave the teacher the flowers\.
2. 2\.The professor gave his notes to the student\. / The professor gave the student his notes\.
3. 3\.My neighbor gave her keys to the friend\. / My neighbor gave the friend her keys\.
4. 4\.The company gave a bonus to the employee\. / The company gave the employee a bonus\.
5. 5\.His family gave the money to the charity\. / His family gave the charity the money\.

## Appendix CFull Results for All Models

Table 7:CompleteΔ​S\\Delta S\(bits/word\) for all 14 models across three construction types\. S = Strong, W = Weak, N = None preemption\. The construction\-level ordering \(Dative\>\>Causative\>\>Locative\) holds for all 14 models\. The perfect monotonicity \(S\>\>W\>\>N\) across all 42 cells is an empirical result: no analytic guarantee ensures this ordering, and the permutation test \(p<\.0001p<\.0001\) confirms it is highly unlikely to arise by chance\.
## Appendix DRegression Diagnostics

The mixed\-effects model from Experiment 2 includes random intercepts and random slopes forPreemptby model identity\.

Table 8:Mixed\-effects regression\. MarginalR2=0\.68R^\{2\}=0\.68; conditionalR2=0\.74R^\{2\}=0\.74\. VIF=1\.34=1\.34; Shapiro\-WilkW=0\.987W=0\.987,p=\.12p=\.12; no Cook’sD\>0\.5D\>0\.5\.### D\.1Robustness: Low\-Collinearity Subset

Re\-estimating with only verbs where\|r​\(Preempt,Entrench\)\|<0\.3\|r\(\\textsc\{Preempt\},\\textsc\{Entrench\}\)\|<0\.3\(n=52n=52\), the preemption coefficient is stable \(β=3\.18\\beta=3\.18,p<\.001p<\.001\) while entrenchment becomes marginal \(β=0\.14\\beta=0\.14,p=\.09p=\.09\)\.

### D\.2Robustness: Alternative Surprisal Measures

Results hold under SLOR normalization\(Lauet al\.,[2017](https://arxiv.org/html/2605.23039#bib.bib48)\)\(r=0\.77r=0\.77with DAIS for LLaMA\-2 7B\) and critical\-region surprisal \(r=0\.75r=0\.75\)\.

### D\.3Additional Circularity Controls

We report three supplementary controls addressing corpus\-model circularity\.Control 1: Raw frequency baseline\.ReplacingPreempt\(vv\) with raw co\-occurrencef​\(v,Cxconv\)f\(v,\\text\{Cx\}\_\{\\text\{conv\}\}\)yieldsR2=0\.41R^\{2\}=0\.41\(vs\.R2=0\.68R^\{2\}=0\.68;Δ​AIC=34\.2\\Delta\\text\{AIC\}=34\.2,p<\.001p<\.001\)\.Control 2: N\-gram baseline\.A 5\-gram model \(KenLM\) shows weaker preemption \(d=0\.83d=0\.83vs\.d=1\.91d=1\.91\) and lower human correlation \(r=0\.41r=0\.41vs\.r=0\.79r=0\.79\), demonstrating that transformer LLMs capture preemption\-relevant information beyond surface co\-occurrence\.Control 3: Primacy of human data\.The human correlations \(r=0\.79r=0\.79\) and non\-circular partial correlations \(§[5\.4](https://arxiv.org/html/2605.23039#S5.SS4)\) constitute the theoretically primary evidence: even if every claim about LLM internals were set aside, the corpus–human partial correlation in §[5\.4](https://arxiv.org/html/2605.23039#S5.SS4)would on its own demonstrate the preemption–entrenchment dissociation in human data, and the corpus–model regression is supplementary to it\.

## Appendix EA Priori Classification Validation

Preemption categories were assigned based on published classifications fromLevin \([1993](https://arxiv.org/html/2605.23039#bib.bib40)\)and corpus thresholds from the British National Corpus \(independent of model training data\)\. Classifications were finalized before anyΔ​S\\Delta Svalues were computed\. BNC\-based classifications agree with Dolma\-based classifications for 116/120 verbs \(96\.7%; Cohen’sκ=0\.94\\kappa=0\.94\)\.

## Appendix FConfound Analysis for Experiment 2

Table 9:Confound matching\. No significant differences \(allp\>\.20p\>\.20; independent\-samplestt\-tests\)\.
## Appendix GCorpus Frequency Extraction

This appendix expands on the brief summary of the corpus\-parsing pipeline in §[3\.4](https://arxiv.org/html/2605.23039#S3.SS4)\. Because the preemption and entrenchment scores depend entirely on accurate construction labels, and because web\-scale corpora are noisy, we provide here the dependency\-pattern templates, filtering steps, manual\-validation methodology, and observed precision/recall for each of the three construction types\.

### G\.1Pipeline Overview

For OLMo, verb–construction co\-occurrence frequencies were extracted from Dolma; for all other models, the Pile was used as a proxy \(r=0\.94r=0\.94between Dolma\- and Pile\-derived preemption scores across the 120 stimulus verbs,p<\.001p<\.001\)\. The pipeline proceeds in four stages, applied identically to both corpora:

1. 1\.Sentence selection\.All sentences containing a lemmatized form of each target verb are extracted using spaCy’sen\_core\_web\_trfmodel\.
2. 2\.Dependency parsing and pattern matching\.Each candidate sentence is parsed; construction\-specific templates \(below\) are applied to assign one of:conv,unconv, orreject\(ambiguous/non\-matching\)\.
3. 3\.Three\-layer filtering\.\(a\) POS\-tag agreement check: matrix verb must carry verbal POS; \(b\) dependency\-pattern strict match; \(c\) whitelist of construction\-defining preposition lemmas \(e\.g\.,*to/for*for prepositional datives;*onto/into*vs\.*with*for content\- vs\. container\-locatives\)\.
4. 4\.Aggregation\.Per\-verb counts are summed across the corpus to producef​\(v,Cxconv\)f\(v,\\text\{Cx\}\_\{\\text\{conv\}\}\)andf​\(v,Cxunconv\)f\(v,\\text\{Cx\}\_\{\\text\{unconv\}\}\)\.

### G\.2Construction\-Specific Templates

#### Dative\.

*Prepositional dative*\(PD\): V\+\+dobj\(theme\)\+\+prep\[*to*/*for*\]\+\+pobj\(recipient/beneficiary\)\.*Double\-object*\(DOD\): V\+\+iobj/dative\(recipient\)\+\+dobj\(theme\), or V\+\+first\-NP\(recipient, animate\)\+\+second\-NP\(theme\)\. Worked example \(PD\): For “*She donated the books to the library*,” spaCy produces*donated*→\\rightarrowROOT;*books*→\\rightarrowdobj;*to*→\\rightarrowprep;*library*→\\rightarrowpobj\. The pattern V\+\+dobj\+\+prep\[*to*\]\+\+animatepobjclassifies as PD\. Worked example \(DOD\): “*She gave the library the books*” yields*gave*→\\rightarrowROOT;*library*→\\rightarrowdative;*books*→\\rightarrowdobj\(or, when spaCy underspecifies, two adjacent post\-verbal NPs with the first being animate and the second inanimate\)\. Animacy is assigned using WordNet supersense tags via spaCy’s noun\-classification extension\.

#### Causative\.

*Transitive causative*: V \(verb of motion/change\-of\-state\)\+\+dobj\(theme\), where the theme is the entity undergoing the change \(e\.g\., “*The wind broke the window*”\)\.*Intransitive \(anti\-causative\)*: V with subject as theme and nodobj\(e\.g\., “*The window broke*”\)\. Periphrastic causatives \(*made the window break*\) are detected via*make*/*cause*/*have*as matrix with the target verb as anxcomp/ccompdependent and contribute to a separate frame that we exclude fromf​\(v,Cxunconv\)f\(v,\\text\{Cx\}\_\{\\text\{unconv\}\}\)counts \(we tested an inclusion variant; results are qualitatively identical,r=0\.97r=0\.97in preemption scores\)\. For verbs with manner\-of\-motion or sound emission \(e\.g\.,*laugh*,*sneeze*\), the existence of a transitive use is by itself the diagnostic feature:*\*The clown sneezed the boy*is essentially unattested, which the parser captures as a lowf​\(v,Cxtrans\)f\(v,\\text\{Cx\}\_\{\\text\{trans\}\}\)counts\.

#### Locative\.

*Content\-oriented*\(theme\-as\-object\): V\+\+dobj\(theme\)\+\+prep\[*onto*/*into*/*on*/*in*\]\+\+pobj\(goal\)\. E\.g\., “*She poured water into the glass*”:*water*→\\rightarrowdobj,*glass*→\\rightarrowpobjof*into*\.*Container\-oriented*\(goal\-as\-object\): V\+\+dobj\(goal\)\+\+prep\[*with*\]\+\+pobj\(theme\)\. E\.g\., “*She filled the glass with water*\.” Because locatives admit a third frame \(drop\-theme or drop\-goal: “*She poured water*”\), these single\-argument instances are excluded from both numerator and denominator ofPreempt​\(v\)\\textsc\{Preempt\}\(v\)\.

### G\.3Handling Noisy Web Text

We applied four noise\-mitigation strategies:

1. 1\.Boilerplate filtering\.Sentences from Common Crawl boilerplate \(cookie notices, navigation text, repeated headers\) were detected via Dolma’s quality filter and removed before parsing\.
2. 2\.Length filtering\.Sentences shorter than 4 tokens or longer than 60 tokens were excluded \(the first risk fragmented parses, the second long\-distance dependencies the parser handles poorly\)\.
3. 3\.POS consistency\.Sentences in which the target verb’s tag conflicted with the lemma’s expected tag \(e\.g\.,*drive*tagged as NN rather than VB\) were rejected\.
4. 4\.Parser\-confidence threshold\.For each candidate construction match, we required the relevant dependency edge to have a parser confidence \(as estimated by ensembling 5 parses with stochastic dropout\) above 0\.75\. Low\-confidence matches were rejected\.

The combined effect of these filters is to reduce the candidate sentence pool by approximately 30–45%, depending on construction\.

### G\.4Validation Method and Precision

For each construction type, we manually validated parser output by hand\-annotating 500 randomly sampled sentences \(1,500 sentences total\), drawn proportionally from the constructions retained after filtering\. Annotators were two computational linguistics graduate students blind to the preemption hypothesis; disagreements were adjudicated by the senior author\. Annotators classified each sentence into one of \{conv,unconv,reject\}\. Verification examples were sampled stratified by verb \(to avoid over\-sampling high\-frequency verbs\) and by parser confidence \(50% from the high\-confidence quartile, 50% from the bottom three quartiles, to surface error modes\)\.

Inter\-annotator agreement before adjudication was high: Cohen’sκ=0\.94\\kappa=0\.94for datives,κ=0\.91\\kappa=0\.91for causatives,κ=0\.89\\kappa=0\.89for locatives\. Pipeline precision \(agreement between pipeline label and adjudicated gold label\) was 96% \(dative\), 93% \(causative\), and 92% \(locative\)\.333Dative precision rounded from 96\.2%, the value reported in the submission’s original validation; causative and locative figures reflect smaller validation samples added during camera\-ready revision\.Recall is more difficult to estimate without exhaustive annotation, but a sample of 200 sentences containing target verbs where the pipeline assignedrejectconfirmed that 87% of rejections were genuine non\-matches; the remaining 13% were predominantly parser errors on long or coordinated sentences\.

### G\.5Sensitivity to Pipeline Choices

To verify that our results are not artifacts of specific pipeline thresholds, we re\-ran preemption\-score computation under three perturbations: \(a\) doubling the parser\-confidence threshold \(0\.75→\\rightarrow0\.90\); \(b\) halving it \(0\.75→\\rightarrow0\.375\); \(c\) replacing strict dependency matching with a more permissive POS\-pattern matcher\. Across all three perturbations, per\-verb preemption scores correlate with the production\-pipeline scores atr≥0\.93r\\geq 0\.93\(dative\),r≥0\.89r\\geq 0\.89\(causative\),r≥0\.85r\\geq 0\.85\(locative\), and the human\-correlation results from §[4](https://arxiv.org/html/2605.23039#S4)reproduce within±0\.04\\pm 0\.04of the reported values\. We conclude that the qualitative pattern is robust to reasonable choices in the parsing pipeline\.

## Appendix HControlled Intervention Details

### H\.1Fine\-Tuning Data Construction

For each of the 10 target verbs in each condition, we generated 500 sentences using templates matched for length \(8–12 words\), subject variety \(20 unique subjects\), and object variety \(20 unique themes/recipients\)\. Template naturalness was validated: mean perplexity under GPT\-2 Medium \(not the target model\) was 22\.1 \(SD = 4\.3\), comparable to natural text \(Wikitext\-103 mean: 18\.9\)\.

### H\.2Fine\-Tuning Hyperparameters

Model: GPT\-2 124M \(base\)\. Learning rate:5×10−55\\times 10^\{\-5\}with linear warmup over 100 steps\. Batch size: 16\. Epochs: 3\. Weight decay: 0\.01\. Random seeds: \{42, 123, 456, 789, 1024\}\.

### H\.3Variance Across Seeds

Table 10:Δ​Δ​S\\Delta\\Delta Sacross 5 random seeds \(*Ampl\.*= Amplified,*Atten\.*= Attenuated,*Rev\.*= Reverse,*Ctrl\.*= Control\)\. All directional predictions hold in every seed\.
### H\.4Verification

Post\-fine\-tuning perplexity on Wikitext\-103 increased by<<5% in all conditions\. The Amplified model showed increased preference for the prepositional dative for target verbs \(p<\.001p<\.001\), while the Attenuated model showed equalized preferences \(p=\.72p=\.72for difference from chance\)\.

## Appendix IScaling Law Sensitivity Analysis

Table 11:Comparison of scaling law functional forms\.
## Appendix JIndividual Verb Analysis

Table 12:Per\-verbΔ​S\\Delta Sand DAIS bias scores \(LLaMA\-2 7B\)\. DAIS = proportion preferring the prepositional dative\.
## Appendix KBLiMP Comparison

BLiMP\(Warstadtet al\.,[2020](https://arxiv.org/html/2605.23039#bib.bib10)\)includes argument structure items on which LMs achieve\>\>90% accuracy\. Our study differs in three ways: \(1\) we test gradient rather than binary acceptability; \(2\) we test the*mechanism*\(preemption vs\. entrenchment\) rather than mere knowledge; and \(3\) we evaluate per\-verb item\-level correlations rather than aggregate accuracy\.Huet al\.\([2024](https://arxiv.org/html/2605.23039#bib.bib69)\)also included gradient constructions; our extension is the*dissociation*of preemption from entrenchment, which Hu et al\. did not test\.

## Appendix LComputing Infrastructure

All experiments were conducted on NVIDIA A100 80GB GPUs\. Surprisal extraction for the largest model \(LLaMA\-2 70B\) required approximately 8 GPU\-hours across all 120 verb items×\\times5 sentence frames×\\times2 constructions\. Smaller models \(GPT\-2, Pythia≤\\leq1B\) completed in under 1 GPU\-hour\. Fine\-tuning \(Experiment 4\) used a single A100 and completed in approximately 45 minutes per condition per seed, totaling approximately 15 GPU\-hours across all conditions and seeds\.

## Appendix MExtended Error Analysis

We analyze systematic discrepancies between LLM predictions and human judgments to identify where the preemption account succeeds and where it falls short\.

### M\.1Verbs Where LLMs Overestimate Preemption

Several low\-frequency verbs receiveΔ​S\>1\.5\\Delta S\>1\.5despite near\-chance DAIS bias scores \(indicating humans find both constructions acceptable\)\. Table[13](https://arxiv.org/html/2605.23039#A13.T13)identifies the primary outliers\.

Table 13:Verbs where LLMΔ​S\\Delta Ssubstantially exceeds the value predicted by DAIS human ratings\. Residuals computed from the linear regressionΔ​S∼DAIS\\Delta S\\sim\\text\{DAIS\}\.The common thread is that these verbs have skewed corpus distributions for reasons other than functional competition:*cable*and*telegraph*are register\-restricted \(formal/archaic\), while*wire*exhibits polysemy \(transfer\-of\-information vs\. physical wire\)\. When these four verbs are excluded, the LLM–human correlation increases fromr=0\.79r=0\.79tor=0\.83r=0\.83, and the preemption–entrenchment dissociation strengthens\.

### M\.2Verbs Where LLMs Underestimate Preemption

A smaller set of verbs show the reverse pattern: humans strongly prefer one construction, but LLMs assign relatively balanced surprisal:

Table 14:Verbs where human DAIS bias exceeds LLM\-predicted preemption\.These verbs may carry pragmatic or semantic factors \(e\.g\.,*guarantee*implies formal commitment;*teach*has strong idiomatic preferences\) that influence human judgments beyond pure distributional competition, consistent with the formal–functional divide discussed in §[8\.2](https://arxiv.org/html/2605.23039#S8.SS2)\.

### M\.3Construction\-Level Error Patterns

Across construction types, we observe a systematic trend: LLM predictions are most accurate for the dative alternation \(RMSE = 0\.31\), intermediate for causative \(RMSE = 0\.38\), and weakest for locative \(RMSE = 0\.47\)\. This ordering mirrors the strength of preemption effects in human data\(Ambridgeet al\.,[2008](https://arxiv.org/html/2605.23039#bib.bib64)\)and may reflect the relative frequency and regularity of these alternations in training corpora\.

## Appendix NCross\-Linguistic Predictions and Testable Hypotheses

While our study is restricted to English \(§[8\.3](https://arxiv.org/html/2605.23039#S8.SS3)\), the theoretical framework generates specific, falsifiable predictions for other languages\. We outline these to facilitate future cross\-linguistic testing\.

### N\.1Agglutinative Languages \(e\.g\., Turkish, Finnish\)

In agglutinative languages, the relevant “constructions” may be morphological rather than syntactic\. For Turkish:

- •The causative suffix*\-tIr*exhibits partial productivity: some verbs resist causativization despite semantic compatibility\(Ambridge,[2020](https://arxiv.org/html/2605.23039#bib.bib27)\)\.
- •Prediction:LLMs trained on Turkish should show higher surprisal for the causative forms of verbs that have conventional periphrastic alternatives \(e\.g\.,*ettirmek*“to cause to do” rather than*\-tIr*\)\.
- •Test:Extract preemption scores from Turkish corpora and correlate with LLM surprisal differentials, as in our Experiment 1\.

### N\.2Isolating Languages \(e\.g\., Mandarin Chinese\)

In Mandarin, argument structure alternations take different forms:

- •The*bǎ*\-construction vs\. the canonical SVO order provides a partial analogue to the dative alternation\.
- •Prediction:Verbs that strongly prefer*bǎ*in relevant communicative contexts should resist the SVO alternative in LLM surprisal patterns\.
- •Challenge:Functional competition may be harder to operationalize because the constructions differ in information structure \(topic/focus\) rather than purely syntactic alternation\.

### N\.3Free Word\-Order Languages \(e\.g\., Russian, German\)

In languages with freer word order:

- •Argument structure alternations may interact with case marking and word\-order preferences\.
- •Prediction:Preemption effects should be detectable in case\-frame alternations: a verb that conventionally takes the dative case should show higher surprisal when used with the accusative\.
- •Infrastructure:The Wilcox et al\. 11\-language surprisal dataset\(Wilcoxet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib2)\)provides ready\-made data for German; WALS\(Dryer and Haspelmath,[2013](https://arxiv.org/html/2605.23039#bib.bib74)\)and Grambank\(Skirgårdet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib55)\)features can guide language selection\.

### N\.4Prioritized Language Sample

Based on typological diversity and resource availability, we recommend initial testing on: Turkish \(agglutinative\), Mandarin \(isolating\), German \(fusional, V2\), Finnish \(agglutinative, morphologically rich\), and Japanese \(SOV, case\-marking\)\. This sample spans four of the six major morphological types in WALS and three distinct word\-order families\.

## Appendix ORelationship to Developmental Models

Our findings connect to the BabyLM Challenge\(Warstadtet al\.,[2023](https://arxiv.org/html/2605.23039#bib.bib72)\), which evaluates language models trained on developmentally plausible data \(10M–100M words\), and to a broader tradition of computational models of acquisition, including gradient parameter\-setting models\(Howittet al\.,[2021](https://arxiv.org/html/2605.23039#bib.bib77)\)and computational accounts of staged syntactic development such as the Null Subject stage\(Dey and Sakas,[2025](https://arxiv.org/html/2605.23039#bib.bib78)\)\.

### O\.1Preemption as an Evaluation Metric

We propose that preemption sensitivity should be included as a standard evaluation metric for cognitively motivated language models\. Specifically:

- •Metric:Pearsonrrbetween modelΔ​S\\Delta Sand DAIS human ratings across the 80 dative verbs\.
- •Baseline:Our GPT\-2 124M achievesr=0\.61r=0\.61; a BabyLM model trained on 100M words should achieve a similar or lower value given reduced data\.
- •Scaling prediction:Our power\-law fit \(Eq\.[4](https://arxiv.org/html/2605.23039#S6.E4)\) predicts that a model with∼\\sim100M tokens should achiever≈0\.50r\\approx 0\.50–0\.550\.55, depending on corpus composition\.

### O\.2Connection to Tachihara & Goldberg’s Human Evidence

Tachihara and Goldberg \([2025](https://arxiv.org/html/2605.23039#bib.bib61)\)demonstrated that human learners acquire preemption through exposure to conventional formulations: the first causal evidence of this kind in humans\. Our Experiment 4 provides a computational parallel: fine\-tuning with increased conventional\-form frequency causes increased preemption behavior\. The convergence of human and computational causal evidence strengthens the theoretical claim that preemption arises from distributional learning\.

## Appendix PAdditional Robustness Analyses

### P\.1Bootstrap Stability of Correlations

To verify the stability of our primary result \(r=0\.79r=0\.79\), we computed the bootstrap distribution of the correlation across 10,000 resamples of the 80 dative verbs\. The distribution is approximately normal \(Shapiro\-WilkW=0\.998W=0\.998,p=\.43p=\.43\), with the 95% CI of \[0\.69, 0\.86\] derived from the 2\.5th and 97\.5th percentiles\. The probability ofr<0\.50r<0\.50in any bootstrap sample is<<0\.001, indicating that the strong correlation is robust to item sampling variability\.

### P\.2Leave\-One\-Model\-Out Analysis

To test whether any single model drives the scaling law fit, we performed leave\-one\-out cross\-validation across the 6 Pythia models\. The power\-law exponentbbis stable across jackknife samples: meanb=0\.091b=0\.091\(SD=0\.008=0\.008, range \[0\.079, 0\.102\]\)\. No single model removal changes the qualitative result\.

### P\.3Sensitivity to Sentence Frame Selection

Each verb was tested with 5 sentence frames\. To verify that results are not driven by particular frames, we computed correlations using each individual frame separately\. Frame\-level correlations with DAIS range fromr=0\.73r=0\.73tor=0\.82r=0\.82\(LLaMA\-2 7B\), with no frame producing a qualitatively different pattern\. The 5\-frame average \(r=0\.79r=0\.79\) is near the center of this range\.

### P\.4Alternative Preemption Score Formulations

We tested two alternative operationalizations of the preemption score:

1. 1\.Log\-odds:Preemptlog​\(v\)=log⁡f​\(v,Cxconv\)\+1f​\(v,Cxunconv\)\+1\\textsc\{Preempt\}\_\{\\text\{log\}\}\(v\)=\\log\\frac\{f\(v,\\text\{Cx\}\_\{\\text\{conv\}\}\)\+1\}\{f\(v,\\text\{Cx\}\_\{\\text\{unconv\}\}\)\+1\}\. This yieldsr=0\.77r=0\.77with DAIS \(vs\.r=0\.79r=0\.79for our Laplace\-smoothed proportion\), indicating that the specific functional form matters little\.
2. 2\.Conditional probability:Preemptcond​\(v\)=f​\(v,Cxconv\)f​\(v\)\\textsc\{Preempt\}\_\{\\text\{cond\}\}\(v\)=\\frac\{f\(v,\\text\{Cx\}\_\{\\text\{conv\}\}\)\}\{f\(v\)\}, omitting the unconventional form entirely\. This yieldsr=0\.74r=0\.74, slightly lower, suggesting that the ratio formulation \(which captures*relative*competition\) is more informative than the simple proportion\.

## Appendix QDetailed Comparison with Yao et al\. \(2025\)

Yaoet al\.\([2025](https://arxiv.org/html/2605.23039#bib.bib71)\)conducted a controlled\-rearing study showing that dative preferences in LMs are shaped by indirect statistical patterns\. Our study extends their work in five specific ways:

1. 1\.Preemption–entrenchment dissociation:Yao et al\. did not test whether the observed effects are driven by competing\-form frequency \(preemption\) or overall verb frequency \(entrenchment\)\. Our Experiment 2 provides this dissociation\.
2. 2\.Human behavioral ground truth:Yao et al\. did not correlate model behavior with human acceptability data\. Our item\-level correlations with DAIS, R&G, and T&G provide independent validation\.
3. 3\.Non\-circular validation:Both our study and Yao et al\.’s involve corpus\-model comparisons\. We address the resulting circularity concern with non\-circular partial correlations against human data \(§[5\.4](https://arxiv.org/html/2605.23039#S5.SS4)\)\.
4. 4\.Reverse\-direction control:Our Experiment 4 includes a reverse\-direction condition that Yao et al\.’s design does not, addressing the tautology concern about frequency manipulation in frequency\-sensitive models\.
5. 5\.Multi\-construction scope:Yao et al\. focused exclusively on the dative alternation; we extend to causative and locative constructions\.

Similar Articles

Can LLMs Take Retrieved Information with a Grain of Salt?

arXiv cs.CL

This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.

What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Hugging Face Daily Papers

This paper proposes the Implicit Curriculum Hypothesis, demonstrating that language model pretraining follows a structured, compositional curriculum where capabilities emerge consistently across architectures and can be predicted from internal representations. The authors validate this through designed tasks spanning retrieval, morphology, coreference, reasoning, and mathematics, finding highly consistent emergence orderings (ρ=0.81) across four model families.

Large Language Models as Modal Models in Linguistics

arXiv cs.CL

This paper applies philosophy of science to argue that LLMs offer epistemic value as minimal models for how-possibly explanations in linguistics, but do not yet qualify as how-actually explanations of human language.

Generalization Dynamics of LM Pre-training (17 minute read)

TLDR AI

This paper reveals that during pre-training, language models frequently and suddenly switch between pattern-matching and generalization behaviors, a phenomenon called mode-hopping, and presents a toy evaluation suite to study it.