Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

arXiv cs.CL Papers

Summary

This paper investigates whether Large Language Models exhibit the same usage-based linguistic productivity constraints (entrenchment and preemption) as humans, finding that models can reproduce coercion but fail to apply statistical preemption to avoid overgeneralization.

arXiv:2606.02953v1 Announce Type: new Abstract: Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:36 AM

# Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt
Source: [https://arxiv.org/html/2606.02953](https://arxiv.org/html/2606.02953)
Claire Bonial1,Claire Benet Post2, Laura Michaelis2,Harish Tayyar Madabushi3 1Georgetown University, 2University of Colorado Boulder,3University of Bath Correspondence:[claire\.bonial@georgetown\.edu](https://arxiv.org/html/2606.02953v1/mailto:[email protected]),[htm43@bath\.ac\.uk](https://arxiv.org/html/2606.02953v1/mailto:[email protected])

###### Abstract

Usage\-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear\. Large Language Models are also usage\-based, in the sense that the structures of language are learned through exposure to vast amounts of text\. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs\. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity \(entrenchment\) in cases ofcoercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item\. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data\.

Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

Claire Bonial1, Claire Benet Post2,Laura Michaelis2,Harish Tayyar Madabushi31Georgetown University,2University of Colorado Boulder,3University of BathCorrespondence:[claire\.bonial@georgetown\.edu](https://arxiv.org/html/2606.02953v1/mailto:[email protected]),[htm43@bath\.ac\.uk](https://arxiv.org/html/2606.02953v1/mailto:[email protected])

## 1Introduction

There is increasing research demonstrating the parallels between building a model of language based on processing vast amounts of text and how people build a grammar of natural language through language usage in interaction \(e\.g\.,Tomasello \([2006](https://arxiv.org/html/2606.02953#bib.bib232),[2009](https://arxiv.org/html/2606.02953#bib.bib78)\)\)\. The success of Large Language Models \(LLMs\) in a variety of tasks can be understood to signal the validity of the assumption that exposure to the patterns of language in the form of text alone is sufficient to build a functional model of language in which one need not posit grammatical rules apriori\. Usage\-based grammatical theories, such as Construction Grammar \(CxG\) \(e\.g\.,Goldberg \([1995](https://arxiv.org/html/2606.02953#bib.bib39)\); Croft \([2001](https://arxiv.org/html/2606.02953#bib.bib40)\); Bybee \([2010](https://arxiv.org/html/2606.02953#bib.bib107)\)\), similarly posit that language usage engenders natural language learning such that frequency of exposure supports entrenchment and generalization of the structures of a languageTomasello \([2003](https://arxiv.org/html/2606.02953#bib.bib54)\)\. Under many of such theories, the usage\-based development of grammar eschews the need to posit syntactic rules: instead of learning and memorizing lexical items and a limited number of syntactic rules guiding their grammatical combination, speakers acquire and learn un\-analyzedholophrasesof language that they are most frequently exposed toTomaselloet al\.\([2008](https://arxiv.org/html/2606.02953#bib.bib428)\)\. Over the course of acquisition, speakers extend the patterns of highly frequent phrases of their language productively to novel and creative combinations\.

Therefore, frequency of exposure to a particular construct \(token of a particular construction type\) of language, it is theorized, plays an important role in the development of grammarHoffmann \([2022](https://arxiv.org/html/2606.02953#bib.bib234)\)\. While frequency contributes to entrenchment of well\-established patterns,statistical preemptionhas a constraining effect on productivity in grammarGoldberg \([2019](https://arxiv.org/html/2606.02953#bib.bib414)\)\. Specifically, statistical preemption accounts for why speakers do not consistently overgeneralize the\-edpast tense ending to irregular verbs: where the linguistic context would cue“go\-ed”, speakers instead consistently hear“went”\. Thus, speakers attend to negative evidence \(what they do not hear in a particular linguistic context\) in developing a grammar\.

Although the role of frequency is largely agreed upon and well\-supported in psycholinguistic literature, Construction Grammar offers little to model or predict the productive patterns of language \(outside of Sign\-Based Construction GrammarMichaelis \([2013](https://arxiv.org/html/2606.02953#bib.bib237)\)\)\. LLMs offer an unprecedented opportunity to study these forces in vast usage\-based models of language\. Here, we explore the parallels between human and LLM processing by focusing specifically on the phenomena ofcoercionandstatistical preemptionin order to test if models similarly leverage both positive and negative evidence in linguistic generalization and avoiding overgeneralization\.*Our experiments show that models acquire constructional categories but do not use them as preemption signals\.*

After introducing the theoretical framework \(§[2](https://arxiv.org/html/2606.02953#S2)\), we summarize relevant linguistic and psycholinguistic research on the role of frequency in linguistic productivity \(§[3](https://arxiv.org/html/2606.02953#S3), §[4](https://arxiv.org/html/2606.02953#S4)\)\. We then present two experiments\. The firstCoercionexperiment tests model ability to interpret semantics of coercive constructions \(e\.g\.,I drank the bottle\) wherein the construal of a lexical item is coerced by the broader linguistic context \(e\.g\., container construed as contents\) \(§[5\.1](https://arxiv.org/html/2606.02953#S5.SS1)\)\. Coercion is one way in which speakers productively extend the patterns of language to novel, creative usages\. The secondStatistical Preemptionexperiment tests model ability to constrain productivity by attending to negative evidence: there is a linguistic context in which a structure is semantically viable, and yet this structure never occurs \(e\.g\.,?The asleep cat purred peacefully; ?The teacher explained me the answer\)111We precede utterances that may be questionably acceptable or unacceptable to most English speakers with ‘?’\.\(§[5\.2](https://arxiv.org/html/2606.02953#S5.SS2)\)\. Thus, statistical preemption curbs productivity and overgeneralization\. Critically, our methodology consistently leverages both familiar \(English\) language and the use of nonce words in order to test for true generalization abilities of models, separate from memorized features of language\. We summarize our results \(§[6](https://arxiv.org/html/2606.02953#S6), §[7](https://arxiv.org/html/2606.02953#S7)\) showing that while models learn from positive evidence to interpret coercive constructions, they do not deploy negative evidence in order to preempt certain structures and constrain complete productivity\.

## 2Usage\-Based Theoretical Framework

In contrast to Generative Grammar theories that posit speakers only memorize minimal information \(lexical items and their meanings, a small, re\-usable set of syntactic rules guiding the grammatical combination of those items\), usage\-based theories posit that speakers store, classify, and cluster a tremendous amount of information with each token of linguistic experienceBybee \([2010](https://arxiv.org/html/2606.02953#bib.bib107)\)\. Under a Construction Grammar approach, these tokens of linguistic experience are form\-meaning pairings that may be words, phrases, or even sub\-parts of words, known asconstructions\. The form pole of a construction is the phonological information which may be fixed or variable\. For example, the Comparative\-correlative construction involves the fixed phonological form “the” combined with two juxtaposed flexible, schematic slots:The higher you fly, the harder you fall\.The meaning pole is the conceptual knowledge of the referent, which will vary depending upon the speaker\. The meaning pole in CxG is enriched by the various social and pragmatic information that speakers store when they encounter a particular lexical item\.

As a result of storing an array of contextual information with each linguistic exposure, speakers build a grammar beginning with frequently heard linguistic forms associated with frequent experiences\. For example, a child learns to associate saying“Mommy”with the experience of their parent appearing\. With further experience, speakers recognize the commonalities between particular linguistic contexts and types of words filling particular slots, such that generalization and information akin to part\-of\-speech arises\. For example, children may first learn to extend the patternWant Xfirst to more of a certain food or drink, then to more of a desirable item such as a toy\. The level of abstraction over the learned constructions increases, as speakers first recognize and abstract over commonalities in the slot after“want”, then note parallels in the slot before it, then note parallels in broader phrases such that the notion of a verb, verb phrase, and clause arise \(speakers deploy this information; but speakers need not be aware of the linguistic labels or meta\-linguistic generalization itself\)\.

In this way, frequency supports the entrenchment of certain structures and patterns of language\. However, it is clear that there are many constructions of language that are not infinitely productive, and in fact there are some patterns that are semi\-productive, wherein the productivity is seemingly idiosyncratic\. For example, the English Adjective\-phrase construction is highly productive:the red ball; that big building; a fun party\.However, there is a class of English adjectives that are not productive extensions of this pattern:?the asleep cat; ?that aloft balloon; ?an ashamed look\. Notice that there is no clear semantic reason that these adjectives are incompatible with the broader English Adjective\-phrase construction; in other words, the set of a\-adjectives do not form a coherent semantic class that is conceptually incompatible with the construction\.

There is, however, a competing viable alternative for expressing that a particular thing has a particular characteristic in English:the ball is red; the cat is asleep\. These a\-adjectives are attested with some frequency within this Predicate\-adjective construction\. Thus, there is negative evidence blocking the usage of a\-adjectives in the Adjective\-phrase construction, and positive evidence supporting the usage of a\-adjectives in the Predicate\-adjective construction\. The fact that speakers do not overgeneralize the Adjective\-phrase construction demonstrates that speakers not only attend to the relative frequencies of constructs within a particular class of constructions, but also that they attend to negative evidence or what is never said\.

Therefore, under a usage\-based, CxG view, generalization of linguistic structures operates under the influence of positive evidence and negative evidence\. In the sections to follow, we delineate linguistic and psycholinguistic research on the positive generalization process of coercion and the negative generalization process of statistical preemption, respectively\. We base our LLM tests on this bedrock of linguistic research and experimentation\.

## 3Coercion

Coercionwas first introduced byPustejovsky \([1991](https://arxiv.org/html/2606.02953#bib.bib120)\), who describes “type coercion” as a formal semantic operation in which a syntactic context requires a particular semantic type, and if the filler doesn’t match that type, it gets coerced into the required type via mechanisms in the lexical entry \(under this view,qualia structure\)\. For example, the verbbeginparadigmatically calls for some type of event that unfolds over time\. Thus,begin the bookis type coercion of the concrete noun book, which is thus interpreted in this linguistic context asbeginreadingthe book\.

In this research, we adopt the definition of coercion outlined inMichaelis \([2004](https://arxiv.org/html/2606.02953#bib.bib121)\), who brings a CxG lens to coercion, and posits the Override Principle:

> “If a lexical item is semantically incompatible with its morphosyntactic context, the meaning of the lexical item conforms to the meaning of the structure in which it is embedded”\(Michaelis,[2004](https://arxiv.org/html/2606.02953#bib.bib121), p\. 25\)\.

Thus, constructional semantics will win out given a clash between a construction and a lexical item\.

WhereasMichaelis \([2004](https://arxiv.org/html/2606.02953#bib.bib121)\)focuses primarily on aspectual coercion, we focus on the coercive process that enables type shifting from container to contents, described in detail inRadden and Kövecses \([1999](https://arxiv.org/html/2606.02953#bib.bib119)\)as a common metonymic conceptual relationship\. The authors argue that speakers are more interested in the contents of a container than the container itself\. Thus, we commonly see metonymies which target the contents via the container:This is an excellent bottle\(referring to the wine within the bottle\)\.

Notice that interpretation and deployment of coercive constructions requires a fair amount of world knowledge that supports conceptual clusters such as containers and contents, but also possessor for possessed, and part\-whole metonymy\. This is consistent with a cognitive view of word meaning in which concepts are not defined according to necessary and sufficient conditions, but instead by clusters of metaphors, where “each metaphor highlights certain aspects of the concept and implicitly hides others”\(Lakoff and Johnson,[1980](https://arxiv.org/html/2606.02953#bib.bib117), p\. 201\)\. Thus, coercion facilitates an atypical, but plausible interpretation of a lexical item in a particular constructional context\. Furthermore, coercion is one process supporting the productive generalization of constructions to novel, previously unseen instantiations\.

## 4Statistical Preemption

While psycholinguistic literature has demonstrated that more frequent forms are more likely to serve as the basis for productive generalizationBybee and Eddington \([2006](https://arxiv.org/html/2606.02953#bib.bib314)\)through processes like coercion, what prevents speakers from overgeneralizing? Experimental evidence points to the effect of statistical preemption:

> An inference that speakers make from repeatedly hearing form B, when a semantically and pragmatically appropriate expression A could have been used, that B is the appropriate formulation while A is not\(Suttle and Goldberg,[2011](https://arxiv.org/html/2606.02953#bib.bib315)\)\.

This process has been experimentally demonstrated in studies showing that this is how children eschew overgeneralization of certain morphological Cxns\. As discussed earlier, consider, for example, the English past tense ending“\-ed”on high\-frequency irregular verbs: Having repeatedly heard“went”in the contexts where“go\-ed”might be expected is negative, indirect evidence that“go\-ed”is not appropriate\(Aronoff,[1976](https://arxiv.org/html/2606.02953#bib.bib399); Kiparsky,[1982](https://arxiv.org/html/2606.02953#bib.bib400)\)\.

The process of statistical preemption at the phrasal level is more complex because it is not clear what is semantically and pragmatically similar enough to preempt another phrasal form\. Indeed, there are several cross\-linguistic studies of quite clearly parallel forms, such as Czech variants of nominative declensions, that demonstrate that statistical preemption does not happen as “reliably” as one might expect \(e\.g\.,Bermel and Knittl \([2012a](https://arxiv.org/html/2606.02953#bib.bib401),[b](https://arxiv.org/html/2606.02953#bib.bib402)\)\)\. Instead, linguistic variation is maintained due to an emergent, specialized relationship between a particular minority form and context\. This makes it difficult to pinpoint when a semantically similar form may or may not be pragmatically appropriate and therefore interchangeable with a higher\-frequency form\.

Nonetheless, compelling experimental evidence for statistical preemption has come from several psycholinguistic studies, includingBoydet al\.\([2012](https://arxiv.org/html/2606.02953#bib.bib403)\)as well asTheakston \([2004](https://arxiv.org/html/2606.02953#bib.bib404)\)\.Theakston \([2004](https://arxiv.org/html/2606.02953#bib.bib404)\)concludes that entrenchment operates independently from preemption in a task where both children and adults are asked to provide grammaticality ratings for sentences with argument structure errors, involving both high and low\-frequency verbs\. She finds that frequency plays a strong role: for both children and adults, lower\-frequency verbs were found to be more acceptable than high\-frequency verbs in the erroneous argument structures\. Theakston notes that the adults would certainly be expected to have preempting structures for both high and low\-frequency verbs, yet this did not remove the frequency effect observed in participants’ judgments of grammaticality\. Thus, we can see that preemption is not an all\-or\-nothing process, speakers may remain less confident of the range of Cxns in which lower\-frequency items are used\.

Our research most directly draws upon the experimental evidence ofBoyd and Goldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib413)\), who discuss the phenomenon of blocking a\-adjectives \(e\.g\.,awake, aware, ashamed, awash\) in the pre\-nominal position of the English Adjective\-phrase construction \(e\.g\.,?The awake child\)\. In this research, speakers track the fact that although adjectives can generally sit in either the prenominal position of the English Adjective\-phrase construction \(e\.g\.,the blue bucket\) or in the post\-copula position of the Predicate\-adjective construction \(e\.g\.,the bucket is blue\), that the phonologically shaped class of English a\-adjectives is only observed in the Predicate\-adjective construction \(e\.g\.,the child is awake\)\. As a result of this*negative evidence*, speakers avoid producing even novel, nonce a\-adjectives in the pre\-nominal Adjective\-phrase construction\.

We also base our own experimentation with LLMs on corpus evidence presented inGoldberg \([2019](https://arxiv.org/html/2606.02953#bib.bib414)\), which, as the titleExplain me thishighlights, explores statistical preemption in the case of the Ditransitive \(sometimes also referred to as the Double\-object construction\) Dative constructions, also previously explored inGoldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib415)\)\. This research focuses on explaining the fact that some ‘alternating’ verbs can felicitously combine with both constructions\. For example, both the motion verbkickand the transfer verbgive:

Ditransitive: I gave/kicked him the ball\. Dative: I gave/kicked the ball to him\.

However, there are also sets of verbs that are found almost exclusively in corpora in the Dative, but not the Ditransitive variant; for example,“push”: Push the box to me; ?Push me the box\. Goldberg synthesizes other research to conclude that speakers may be able to extend evidence gleaned from statistical preemption to other members of a class, but limits this conclusion to morphosyntactic classes\. In the case of the Ditransitive, the verb must be able to encode transfer semantics\. Thus, we hypothesize that if speakers are presented with a novel verb that seems to encode only motion semantics, it is plausible that they will extend statistical preemption patterns to that novel verb, blocking realization in the Ditransitive construction\.

## 5Experimental Methodology

Here, we aim to test model abilities with respect to linguistic generalization\. We draw upon the research summarized in the previous section to set up comparative experiments in which we compare model performance in 1\) generalizing to novel coercive constructions, and 2\) avoiding overgeneralizing to statistically preempted structures\. In both experiments, testing for true generalization requires an evaluation methodology that precludes model memorization of data masking as generalization of a pattern\. Thus, in both cases, we test not only familiar, existing English words, but also novel nonce words\. This ensures that performance on the nonce test items cannot be accounted for via memorization\.

### 5\.1Experiment 1: Coercion

The coercive phenomenon that we focus on is coercive type shifting between a container and its contents\. As described previously, linguistic literature posits that a reference to the container often stands in for the contents\. Thus, upon hearingThis is an excellent bottle\!a listener may construe from the linguistic context and commonsense information that the speaker is complimenting the contents of the bottle, not the bottle itself\. Our experimental setup tests whether or not models are attuned to both the broader linguistic context and commonsense knowledge such that either model surprisal \(in base decoder models\) or forced\-choice reflects the forced\-choice acceptability judgments of native English speakers\.

Leveraging linguistic studies as the basis, we synthesize a dataset of 60 paired sentences all with familiar English words for containers and contents, and a parallel dataset of another 60 paired sentences with novel, nonce container or contents nouns\. Preceding each target coercive sentence, there are 4 context\-setting sentences that highlight the container usage of the coerced noun\. For example, both models and annotators were asked to choose either Option A or B as sounding more natural:

1. 1\.Option A: Jim rinsed the glempkit under the tap\. He placed the glempkit on the stove\. He corked the glempkit again\. He lifted the glempkit by the rim\.Then he sipped the glempkit\.
2. 2\.Option B: Jim blended the tropmir with the milk\. He squirted the tropmir from the bottle\. He dissolved the tropmir in the water\. He spread the tropmir across the dough\.Then he chipped the tropmir\.

While the Option A context\-setting sentences force an interpretation of the nonce word as some type of container, those of Option B force the interpretation of another nonce word as a substance\. The final, bold\-faced sentence provides the coercive context\. For Option A, what is clearly a container is coerced into a substance reading in the context of the verb“sip”\. For Option B, what is clearly a substance is coerced into a container reading in the context of the verb“chipped”\. In this case, Option A represents the coercion of the container as contents, and Option B represents the coercion of the contents as container \(which option is presented as A or B is randomized in testing\)\.

Across the set of 60 English paired sentences and 60 sentences with nonce nouns, we compare model and annotator choice of which option is more natural\. For base decoder models where we can measure surprisal \(through perplexity\), we consider the option with lower surprisal to be the more natural option\. We note the complementarity of having both surprisal and forced\-choice: surprisal reads lexical statistics directly; forced choice reads metalinguistic acceptability\. Together, they triangulate the construction\-level signal\.

### 5\.2Experiment 2: Statistical Preemption

We focus on statistical preemption in a\-adjectives as well as the Ditransitive/Dative alternation\. Again, we leverage a setup in which we have one dataset with familiar English test items, and a parallel dataset with novel, nonce words\.

For the a\-adjective stimuli, we refine and extend the test items ofBoyd and Goldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib413)\), to collect a set of 20 test items where a\-adjectives are paired with near synonyms that are not a\-adjectives \(e\.g\.,asleep, sleeping\)\. Both the a\-adjective and the synonym are presented in the pre\-nominal Adjective\-phrase construction and the copular Predicate\-adjective construction\. In the nonce variant of the dataset, the a\-adjectives and synonymous counterparts \(that do not begin with “a”\) are replaced with novel nonce words carefully selected to preserve the a\-adjective property while not invoking real words\. For example, considering the following set of nonce test items which replace the familiar wordsasleep, sleeping:

a\-adjective; pre\-nominal:The asneld cat purred softly on the cushion\. a\-adjective; predicative:The cat that was asneld purred softly on the cushion\. synonym; pre\-nominal:The blelming cat purred softly on the cushion\. synonym; predicative:The cat that was blelming purred softly on the cushion\.

We rely on previous human judgments fromBoyd and Goldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib413)\)to posit that a\-adjectives are dispreferred in the pre\-nominal position, but acceptable in the Predicate\-adjective position, while other synonymous adjectives are acceptable in either position\. We present the models with the familiar and nonce test items, and measure surprisal for base decoder models and collect forced\-choice responses from other models\. We compare human acceptability judgments to the forced\-choice responses and investigate whether or not perplexity is higher for the test items that would be preempted by people\.

For the Ditransitive/Dative alternation, we synthesize a dataset of 40 test items comprised of familiar, English words and a parallel dataset of 40 test items where nonce words stand in for a target verb\. We present 4 context\-setting sentences prior to the target test sentence that instantiates the familiar or novel verb in the Ditransitive and the Dative constructions\. For example, consider the following set of nonce test items which replace a transfer verb like“sell”and a motion verb like“transport”respectively:

Establishment context for transfer verb:The baker blesped a fresh loaf in exchange for two pound coins this morning… Dative:Sarah blesped the document to her sister\. Ditransitive:Sarah blesped her sister the document\. Establishment context for motion verb:The conveyor belt brulmed the boxes across the warehouse floor… Dative:Sarah brulmed the document to her sister Ditransitive:Sarah brulmed her sister the document

We present both models and human annotators with each test item and compare forced\-choice responses as to whether the Dative or Ditransitive is more natural sounding, and we also compare whether or not higher surprisal aligns with responses that people marked as less natural\. Based on the existing experimental and corpus evidence, if speakers extend preemption patterns to other class members, then we would expect the verbs highlighted by context to be transfer verbs to be acceptable in the Ditransitive, while verbs highlighted by the context to be motion verbs are not acceptable due to preemption of this pattern\.

### 5\.3Models and Measures

We test across the models listed in Table[1](https://arxiv.org/html/2606.02953#S5.T1), providing a cross\-family and cross\-paradigm model of panels for robustness of our results\.

Table 1:Models evaluated\.We use two scoring paradigms\. For the base models \(Llama\-3\.1\-8B base, Qwen3\-32B base\), we take autoregressive surprisal on each candidate sentence conditioned on the establishment context and select the lower\-surprisal option; there is no A/B swap, since surprisal is deterministic per option\. For the instruct models, we present both options as A and B and let the model pick one, counterbalancing AB and BA orderings to absorb slot bias\. The two paradigms read different things off the same item: surprisal reads lexical statistics directly, forced\-choice reads metalinguistic acceptability\. Running both lets us triangulate the construction\-level signal, and any divergence between the two is itself a finding rather than noise\. We test both forced\-choice and surprisal on gpt\-4o\-mini which allows both to check for cross method agreement although gpt\-4o\-mini is instruction tuned, which would normally preclude surprisal\-style measurement\.

We note that surprisal, log probability, and perplexity are closely related transformations of the same underlying quantity\. Surprisal is the negative log probability a model assigns to a token; perplexity is the exponential of the mean surprisal\. “Log probability” and “perplexity” just describe how the surprisal was obtained, not a different measure\.

In these experiments, we adopt the term surprisal\. However we obtain this via perplexity \(surprisal taken from the model run as a language model over the stimulus\) for the two base decoders \(Llama\-3\.1\-8B base and Qwen3\-32B base\)\. We obtain this via log probabilities \(surprisal taken from the model’s answer\-token log probabilities — a soft forced choice\) for the GPT\-4o\-mini\. All other model results are forced choice \(the model’s discrete A/B pick\), not a surprisal measure\.

## 6Results: Coercion

Our coercion experiment compares model responses or surprisal across test items that first provide a context establishing or highlighting a noun’s status as either a container or contents\. The target sentence then forces a choice between a construction coercing a contents reading of something established as a container, or a container reading of something established as a substance/contents\.

The human forced\-choice annotation demonstrates that coercion is readily available in the direction of coercing the container to refer to the contents, as expected given relevant linguistic literature\. Specifically, human annotators prefer this coercion option in 96% of the test items\. Table[2](https://arxiv.org/html/2606.02953#S6.T2)summarizes comparative results across models and human preference\.

Table 2:Results summary for stimuli testing the interpretation of coercive constructions; the final column compares the percentage of Container coercing Contents reading to the percentage of human preference for this choice at 96%\.The largest models approach human\-like preferences: we measure Qwen3\-32B base preference based on per\-token surprisal and find higher surprisal for the Contents coercing Container option, indicating that the Container coercing Contents option is less surprising and more expected 85% of the time\. Smaller models do also show a matching overall preference for the Container coercing Contents option, and notably we see near identical results for GPT\-4o\-mini, where we measure via both forced\-choice and surprisal\. Importantly, our result shows that even when faced with nonce words, models, like people, are able to understand from prior context the semantics of the target lexical item, and furthermore, interpret the coerced semantics such that models reflect a human\-like preference for the Container as Contents coercion in comparison to Contents as Container\. It is worth emphasizing that this requires models to interpret novel lexical items on the fly from context, which in turn requires tapping into commonsense information about what kinds of things have substance\-like properties and which have container\-like properties\. Increased model scale helps models generalize, such that novel instantiations of constructions can be interpreted in a human\-like fashion when faced with coercive, productive extensions\.

## 7Results: Preemption

### 7\.1A\-adjective

We first compare model results against human acceptability judgments fromBoyd and Goldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib413)\)of familiar and nonce a\-adjectives in either the pre\-nominal Adjective\-phrase construction or the post\-copula Predicate\-adjective construction\.

Table[3](https://arxiv.org/html/2606.02953#S7.T3)summarizes model choice in comparison to human preference \(“B&G humans” referring toBoyd and Goldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib413)\)\) for familiar, English words\. Human preference is strongly for non\-a\-adjectives in the pre\-nominal Adjective\-phrase position: people prefer the a\-adjective at a rate 0\.200 in comparison to the non\-a\-adjective in the same position at a rate of 0\.780\. This provides a difference or gap of \-0\.580; we use this gap as a measure of the strength of preference\.

When compared to model results, we see that all of the instruction\-tuned models demonstrate the same statistical preemption blocking of the pre\-nominal a\-adjective and preference for the non\-a\-adjective in the pre\-nominal position\.

Only Qwen3\-32B base and Llama\-3\.1\-8B base are scored by surprisalargmin\(i\.e\. pick the lower\-surprisal form per pair\)\. The Predicate\-adjective form has lower surprisal on every pair for both a\-adjectives and non\-a\-adjectives as the relative\-clause frame adds left context before the adjective\. As a result,argminfloors the pre\-nominal Adjective\-phrase preference rate to 0 for a\-adjectives and controls alike, collapsing the gap to 0\. This is a measurement floor, not a failed restriction\. The whole measurement lives in the magnitude of the surprisal difference \(in nats, or bits if preferred\);argminthrows that away by reducing it to a binary pick\. Therefore, an alternative way to score a base decoder is the graded Class × Position interaction, which is clearly positive: a\-adjectives carry a larger attributive penalty than the controls \(Qwen3\-32B \+2\.81 nats, Llama\-8B \+3\.07 nats\)\. This confirms that the base models too mirror the preemptive behaviors of people\.

Overall, our results demonstrate that for familiar English test items, the result of statistical preemption holds for all models: the non\-a\-adjectives are strongly preferred in the pre\-nominal position\.

Table 3:Pre\-nominal position preference rates for*a\-*adjectives vs\. non\-*a\-*adjectives, with the gap \(*a\-*adj−\-non\-*a*\)\. Negative values indicate the expected restriction on*a\-*adjectives in pre\-nominal position\.Table[4](https://arxiv.org/html/2606.02953#S7.T4)summarizes model choice in comparison to humans for the nonce a\-adjectives and nonce non\-a\-adjectives in the pre\-nominal Adjective\-phrase construction position\. Results for nonce adjectives diverge sharply from the familiar adjectives: No model produces a significant gap, and in fact the gaps collapse to roughly zero or hold in the wrong direction \(i\.e\. models show no preference for a\-adjectives or non\-a\-adjectives in either position, or prefer the a\-adjectives\)\. Thus, our findings demonstrate that while models can generalize preemption patterns for familiar words, models cannot deploy this generalization in the face of novel words\.

Table 4:Pre\-nominal position preference rates for*a\-*adjectives vs\. non\-*a\-*adjectives, with the gap \(*a\-*adj−\-non\-*a*\)\. Positive values indicate that the model places*a\-*adjectives in pre\-nominal position*more*often than non\-*a*\-adjectives — the opposite of the human pattern\.We note that a potential confound may be that while humans recognize the a\-adjective character of the nonce adjectives and therefore extend preemption patterns to the novel adjectives, models do not recognize this feature\. We therefore test whether or not models can accurately identify the novel a\-adjectives in comparison to other nonce adjectives and find that models are able to identify the distinct adjective classes near 100% of the time \(full results reported in Appendix[Appendix: Class Identification](https://arxiv.org/html/2606.02953#Sx1)\)\. Every frontier model identifies the class accurately from form\. Thus, the category is recoverable from text: the model knowsasneldpatterns withasleepandblelmingdoes not\. Furthermore, if we explicitly prompt the model with the preemptive rule blocking a\-adjectives in the pre\-nominal position, the models then apply this preemption perfectly in subsequent testing\.

The novel forced\-choice failures are about deployment, not identification: GPT\-5\.4’s collapse \(showing a clear opposite preference for a\-adjectives in the pre\-nominal position\) was not class\-blindness as we know that when asked directly, it classifies the same words correctly\. We conclude that the a\-adjective category is extractable and identifiable to models, but preemption does not follow: the models do not apply the category restriction productively to novel a\-adjectives without being told to\.

### 7\.2Ditransitive & Dative Alternation

Next, we compare model responses on acceptability of familiar transfer or motion verbs in both the Dative and Ditransitive \(or “Double Object” \(DO\)\), as well as nonce verbs, wherein context establishes them as either transfer or motion verbs, in both the Dative and Ditransitive\. We compare model responses to expected acceptability based on corpus studies and linguistic literature, from which we established the hypothesis that when context highlights a verb’s transfer semantics, the Ditransitive is not blocked, whereas when context highlights a verbs motion semantics, the Ditransitive is blocked while the Dative variant is acceptable\.

Results for familiar transfer and motion verbs are given in Table[5](https://arxiv.org/html/2606.02953#S7.T5)\. We note that a random pick, demonstrating no preference for either verb class in either construction, would be 50% pick rate for the Ditransitive/DO construction\. In contrast, the expected preference is for transfer verbs to pattern with the Ditransitive/DO construction \(about 50% in the second reference line of Table[5](https://arxiv.org/html/2606.02953#S7.T5)\) while motion verbs are never preferred in the Ditransitive/DO construction \(0% in the second reference line of Table[5](https://arxiv.org/html/2606.02953#S7.T5)\)\.

Note that in this evaluation, statistical preemption must operate over a semantic class; preemption of purely motion verbs must be extended productively\. Our results show that the effect of statistical preemption is also paralleled in model preference: nearly all models show a clear preference for verbs with transfer semantics in the Ditransitive/DO construction \(selected as more natural between 25 and 50% of the time, depending upon the model\) while generally blocking placing motion verbs in this construction \(all models, except DeepSeek V4\-Flash, only select this option as more natural between 5 and 27\.5% of the time\)\.

TierModelTransfer DO%Motion DO%Class effectReferenceRandom pick \(no preference\)50%50%0ReferenceTarget \(construction working\)∼\\sim50%∼\\sim0%∼\+50\\sim\+50Closed — largeGPT\-5\.437\.5%15\.0%\+22\.5\+22\.5Closed — midGPT\-4o\-mini40\.0%10\.0%\+30\.0\+30\.0Closed — midGPT\-4o\-mini \(surprisal\)40\.0%10\.0%\+30\.0\+30\.0Closed — smallDeepSeek V4\-Flash50\.0%50\.0%\+0\.0\+0\.0Open — large \(∼\\sim32B\)Qwen3\-32B base30\.0%20\.0%\+10\.0\+10\.0Open — small \(7–8B\)Qwen2\.5\-7B\-Instruct32\.5%15\.0%\+17\.5\+17\.5Open — small \(7–8B\)Llama\-3\.1\-8B base25\.0%5\.0%\+20\.0\+20\.0Open — small \(7–8B\)Llama\-3\.1\-8B\-Instruct37\.5%27\.5%\+10\.0\+10\.0Table 5:Double\-object \(DO\) construction preference rates for possession verbs vs\. motion verbs, with the class effect \(transfer DO%−\-motion DO%\)\. The expectation is that transfer verbs lead to a strong preference for DO while motion verbs block DO\.Table[6](https://arxiv.org/html/2606.02953#S7.T6)summarizes model responses when faced with nonce verbs where context frames the nonce verb either as a transfer or motion verb\. Like the results for a\-adjective preemption, we see again that the blocking effect of preemption is lost when the model faces nonce verbs\.

TierModelTransfer DO%Motion DO%Class effectReferenceRandom pick \(no preference\)50%50%0ReferenceTarget \(construction working\)∼\\sim50%∼\\sim0%∼\+50\\sim\+50Closed — largeGPT\-5\.432\.5%20\.0%\+12\.5\+12\.5Closed — midGPT\-4o\-mini35\.0%12\.5%\+22\.5\+22\.5Closed — smallDeepSeek V4\-Flash50\.0%50\.0%\+0\.0\+0\.0Open — large \(∼\\sim32B\)Qwen3\-32B base5\.0%0\.0%\+5\.0\+5\.0Open — small \(7–8B\)Qwen2\.5\-7B\-Instruct30\.0%35\.0%−5\.0\-5\.0Open — small \(7–8B\)Llama\-3\.1\-8B base5\.0%0\.0%\+5\.0\+5\.0Open — small \(7–8B\)Llama\-3\.1\-8B\-Instruct35\.0%32\.5%\+2\.5\+2\.5Table 6:Double\-object construction preference rates for nonce transfer verbs vs\. nonce motion verbs, with the class effect \(transfer DO%−\-motion DO%\)\.We may again ask if this is a failure of preemption or a failure of recognizing the distinct motion and transfer classes of verbs as such\. We therefore test models on accuracy in identifying the familiar and nonce verbs \(where context cues either motion or transfer semantics\) as either motion or transfer verbs\. We report full results in Appendix[Appendix: Class Identification](https://arxiv.org/html/2606.02953#Sx1); we show that all models are able to identify the classes accurately 100% of the time\. The context\-setting sentences establish the semantics of the nonce verbs: the models know exactly which novel verbs are transfer and which are motion\. The failure to demonstrate a preference for novel transfer verbs within DO is therefore a deployment failure, not an identification failure\. The model can classify the verb, it just does not extend the preemption pattern to the novel verbs\. This mirrors our nonce a\-adjective findings exactly\. In both experiments, the model recovers and identifies the relevant category \(a\-adjective class; transfer\-vs\-motion class\) but does not productively deploy the preemption\-based restriction on novel items\. Identification is solved; productive deployment from the structure alone is the gap\.

## 8Related Work

The question of whether LLMs acquire grammatical knowledge beyond memorized surface patterns has been pursued from several directions\. Our experiments sit at the intersection of three\.

A first line of work probes LLMs for sensitivity to syntactic structure using targeted minimal pairs\. This began withLinzenet al\.\([2016](https://arxiv.org/html/2606.02953#bib.bib417)\)on subject\-verb agreement and was substantially extended by the BLiMP benchmarkWarstadtet al\.\([2020](https://arxiv.org/html/2606.02953#bib.bib418)\)\. These studies establish that models track structural dependencies well above chance, but they evaluate grammatical knowledge rather than productivity\. They ask whether models can recognize a violation, not whether models can extend a pattern to novel material under the right contextual pressure\. Our coercion experiment is closer in spirit to productivity probes: a model that has memorizeddrink the bottletells us little, but a model that interpretssip the glempkitas targeting the contents has done generalization work\.

A second line, more directly relevant, asks whether models encode construction\-level knowledge in the sense of Construction Grammar\.Tayyar Madabushiet al\.\([2020](https://arxiv.org/html/2606.02953#bib.bib19)\)probe BERT for sensitivity to the form\-meaning pairings of specific constructions, and later work demonstrates that model sensitivity to a construction depends, in part, upon its level of schematicity \(whether it has fixed, phonological items or is comprised of fully flexible slots\)Bonial and Tayyar Madabushi \([2025](https://arxiv.org/html/2606.02953#bib.bib412)\)\.Weissweileret al\.\([2022](https://arxiv.org/html/2606.02953#bib.bib243),[2023](https://arxiv.org/html/2606.02953#bib.bib293)\)examine the comparative correlative and a handful of other English constructions, finding that models recognize constructional forms but often fail on the corresponding meanings\.Scivettiet al\.\([2025](https://arxiv.org/html/2606.02953#bib.bib419)\)also explore downstream understanding of constructional semantics and show that while models perform impressively on Natural Language Inference tasks involving common argument structure constructions, model performance degrades sharply on syntactically identical, less common constructions\.Mahowald \([2023](https://arxiv.org/html/2606.02953#bib.bib420)\)tests GPT models on thea beautiful five daysconstruction and finds the pattern is reproduced, but with weaker semantic grounding than in humans\. The general pattern these studies converge on is that models recognize constructional form and do less well on the semantics and the conditions on productive extension\. This is consistent with what we find for the identification half of our preemption experiments\. Our results give that pattern a sharper edge\. The gap is not between form and meaning\. It is between recognizing a category and withholding a construction from members of it\.

A third line addresses overgeneralization and indirect negative evidence specifically\. Most of this work focuses on children rather than models and we have already covered much of this research in §[3](https://arxiv.org/html/2606.02953#S3)and §[4](https://arxiv.org/html/2606.02953#S4)\.Boyd and Goldberg \([2011](https://arxiv.org/html/2606.02953#bib.bib413)\),Robenalt and Goldberg \([2015](https://arxiv.org/html/2606.02953#bib.bib423)\), andPerek and Goldberg \([2017](https://arxiv.org/html/2606.02953#bib.bib424)\)jointly establish that preemption and entrenchment shape how human learners retreat from overgeneralization, with the dative alternation and a\-adjectives as the canonical test cases\.

Whether LLMs deploy analogous mechanisms is comparatively underexplored\.Misra and Kim \([2024](https://arxiv.org/html/2606.02953#bib.bib425)\)examine LLM judgments on dative alternation items fromBresnanet al\.\([2007](https://arxiv.org/html/2606.02953#bib.bib426)\)and report that models reproduce adult preferences for familiar verbs, but do not test productive extension to novel verbs\.Wilcoxet al\.\([2023](https://arxiv.org/html/2606.02953#bib.bib427)\)examine surprisal patterns over dative alternations and find graded effects that track corpus frequency more closely than constructional class\. Neither addresses the question we focus on here: whether the same negative\-evidence sensitivity that lets models match human preferences on familiar items extends to novel items whose class membership is recoverable from context\. We find that it does not\. The familiar/nonce contrast is what isolates productive deployment from memorized lexical statistics, and it is precisely on this contrast that prior work has not pushed\.

## 9Conclusions

In this research, we have presented two opposing statistical forces that psycholinguistic and corpus studies have shown to operate in facilitating generalization and productivity in human grammar, as well as constraining productivity and avoiding overgeneralization of particular forms\. We can see the productive generalization operating clearly in the case of coercion, while we see generalization from negative evidence, from what is not said, in the case of statistical preemption\. While usage\-based theories such as CxG have recognized the importance of the role in frequency in entrenchment and preemption, the advent of LLMs now offers an opportunity to explore the measurement of these forces in web\-scale models of natural language\.

Furthermore, we are able to test a battery of models for whether or not frequency plays a similar role in LLM generalization of linguistic structures\. Our first coercion experiment demonstrates that models are able to generalize from familiar linguistic information and commonsense knowledge to correctly interpret novel nouns as referring to either Contents or Containers, and thereafter infer and mirror the human\-like preference for coercing Containers to read as Contents\. The finding is clear: models are able to generalize from positive evidence in reinterpreting novel nouns in coercive constructions\.

Our second set of preemption experiments demonstrate that models can also extend preemptive patterns*but only when presented with familiar English test items*\. Models reflect human\-like preemptive patterns blocking familiar a\-adjectives in the pre\-nominal position of the Adjective\-phrase construction, and blocking familiar verbs with a motion construal in the Ditransitive/Double Object construction\. However, when we test whether or not models can extend these preemptive patterns to previously unseen, novel a\-adjectives and motion/transfer verbs, the preemption fails\.

*We conclude that while models can generalize to a rather remarkable degree when this is based on positive frequency signals, current models do not deploy negative evidence in the way that people do\.*Thus, our research points to a critical generalization gap in models as compared to people\. We emphasize that this deepens our understanding of model successes and failures and enables novel insights for improving model performance, as well as bolstering theoretical understanding of statistical forces in human language\.

## References

- Word formation in generative grammar cambridge\.Linguistic Inquiry, Monograph one, The MIT press\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p3.1)\.
- N\. Bermel and L\. Knittl \(2012a\)Corpus frequency and acceptability judgments: a study of morphosyntactic variants in czech\.Corpus Linguistics and Linguistic Theory8\(2\),pp\. 241–275\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p4.1)\.
- N\. Bermel and L\. Knittl \(2012b\)Morphosyntactic variation and syntactic constructions in czech nominal declension: corpus frequency and native\-speaker judgments\.Russian linguistics36\(1\),pp\. 91–119\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p4.1)\.
- C\. Bonial and H\. Tayyar Madabushi \(2025\)Constructing understanding: on the constructional information encoded in large language models\.Language Resources and Evaluation59\(4\),pp\. 4559–4598\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p3.1)\.
- J\. K\. Boyd, F\. Ackerman, and M\. Kutas \(2012\)Adult learners use both entrenchment and preemption to infer grammatical constraints\.In2012 IEEE International Conference on Development and Learning and Epigenetic Robotics \(ICDL\),pp\. 1–2\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p5.1)\.
- J\. K\. Boyd and A\. E\. Goldberg \(2011\)Learning what not to say: the role of statistical preemption and categorization in a\-adjective production\.Language87\(1\),pp\. 55–83\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p6.1),[§5\.2](https://arxiv.org/html/2606.02953#S5.SS2.p2.1),[§5\.2](https://arxiv.org/html/2606.02953#S5.SS2.p4.1),[§7\.1](https://arxiv.org/html/2606.02953#S7.SS1.p1.1),[§7\.1](https://arxiv.org/html/2606.02953#S7.SS1.p2.1),[§8](https://arxiv.org/html/2606.02953#S8.p4.1)\.
- J\. Bresnan, A\. Cueni, T\. Nikitina, and R\. H\. Baayen \(2007\)Predicting the dative alternation\.InCognitive foundations of interpretation,pp\. 69–94\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p5.1)\.
- J\. L\. Bybee and D\. Eddington \(2006\)A usage\-based approach to spanish verbs of’becoming’\.Language82\(2\),pp\. 323–355\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p1.1)\.
- J\. Bybee \(2010\)Language, usage and cognition\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1),[§2](https://arxiv.org/html/2606.02953#S2.p1.1)\.
- W\. Croft \(2001\)Radical Construction Grammar: Syntactic Theory in Typological Perspective\.Oxford University Press\(en\)\.External Links:ISBN 978\-0\-19\-829954\-7Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1)\.
- A\. E\. Goldberg \(1995\)Constructions: A Construction Grammar Approach to Argument Structure\.University of Chicago Press\(en\)\.Note:Google\-Books\-ID: HzmGM0qCKtICExternal Links:ISBN 978\-0\-226\-30086\-3Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1)\.
- A\. E\. Goldberg \(2011\)Corpus evidence of the viability of statistical preemption\.Cognitive linguistics22\(1\),pp\. 131–153\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p7.1)\.
- A\. E\. Goldberg \(2019\)Explain me this: creativity, competition, and the partial productivity of constructions\.Princeton University Press\.Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p2.1),[§4](https://arxiv.org/html/2606.02953#S4.p7.1)\.
- T\. Hoffmann \(2022\)Construction grammar\.Cambridge Textbooks in Linguistics,Cambridge University Press,Cambridge\.External Links:[Document](https://dx.doi.org/10.1017/9781139004213)Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p2.1)\.
- P\. Kiparsky \(1982\)Lexical morphology and phonology\.Linguistics in the Morning Calm/Hanshin\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p3.1)\.
- G\. Lakoff and M\. Johnson \(1980\)The metaphorical structure of the human conceptual system\.Cognitive science4\(2\),pp\. 195–208\.Cited by:[§3](https://arxiv.org/html/2606.02953#S3.p5.1)\.
- T\. Linzen, E\. Dupoux, and Y\. Goldberg \(2016\)Assessing the ability of lstms to learn syntax\-sensitive dependencies\.Transactions of the Association for Computational Linguistics4,pp\. 521–535\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p2.1)\.
- K\. Mahowald \(2023\)A discerning several thousand judgments: gpt\-3 rates the article\+ adjective\+ numeral\+ noun construction\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,pp\. 265–273\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p3.1)\.
- L\. A\. Michaelis \(2013\)Sign\-based construction grammar\.InThe Oxford Handbook of Construction Grammar,T\. Hoffmann and G\. Trousdale \(Eds\.\),External Links:ISBN 9780195396683,[Document](https://dx.doi.org/10.1093/oxfordhb/9780195396683.013.0008),[Link](https://doi.org/10.1093/oxfordhb/9780195396683.013.0008),https://academic\.oup\.com/book/0/chapter/293149610/chapter\-ag\-pdf/44513311/book\_34551\_section\_293149610\.ag\.pdfCited by:[§1](https://arxiv.org/html/2606.02953#S1.p3.1)\.
- L\. A\. Michaelis \(2004\)Type shifting in construction grammar: an integrated approach to aspectual coercion\.Cognitive Linguistics15\(1\),pp\. 1–67\.External Links:[Link](https://doi.org/10.1515/cogl.2004.001),[Document](https://dx.doi.org/doi%3A10.1515/cogl.2004.001)Cited by:[§3](https://arxiv.org/html/2606.02953#S3.p2.1),[§3](https://arxiv.org/html/2606.02953#S3.p3.1.1),[§3](https://arxiv.org/html/2606.02953#S3.p4.1)\.
- K\. Misra and N\. Kim \(2024\)Generating novel experimental hypotheses from language models: a case study on cross\-dative generalization\.arXiv preprint arXiv:2408\.05086\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p5.1)\.
- F\. Perek and A\. E\. Goldberg \(2017\)Linguistic generalization on the basis of function and constraints on the basis of statistical preemption\.Cognition168,pp\. 276–293\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p4.1)\.
- J\. Pustejovsky \(1991\)The generative lexicon\.Computational linguistics17\(4\),pp\. 409–441\.Cited by:[§3](https://arxiv.org/html/2606.02953#S3.p1.1)\.
- G\. Radden and Z\. Kövecses \(1999\)Towards a theory of metonymy\.Metonymy in language and thought4,pp\. 17–60\.Cited by:[§3](https://arxiv.org/html/2606.02953#S3.p4.1)\.
- C\. Robenalt and A\. E\. Goldberg \(2015\)Judgment evidence for statistical preemption: it is relatively better to vanish than to disappear a rabbit, but a lifeguard can equally well backstroke or swim children to shore\.Cognitive Linguistics26\(3\),pp\. 467–503\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p4.1)\.
- W\. Scivetti, M\. Torgbi, M\. Shichman, T\. Pellegrin, A\. Blodgett, C\. Bonial, and H\. T\. Madabushi \(2025\)Beyond memorization: assessing semantic generalization in large language models using phrasal constructions\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 1184–1201\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p3.1)\.
- L\. Suttle and A\. E\. Goldberg \(2011\)The partial productivity of constructions as induction\.Linguistics49\(6\),pp\. 1237–1269\.External Links:[Link](https://doi.org/10.1515/ling.2011.035),[Document](https://dx.doi.org/doi%3A10.1515/ling.2011.035)Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p2.1.1)\.
- H\. Tayyar Madabushi, L\. Romain, D\. Divjak, and P\. Milin \(2020\)CxGBERT: BERT meets Construction Grammar\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),Barcelona, Spain \(Online\),pp\. 4020–4032\.External Links:[Link](https://aclanthology.org/2020.coling-main.355),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.355)Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p3.1)\.
- A\. L\. Theakston \(2004\)The role of entrenchment in children’s and adults’ performance on grammaticality judgment tasks\.Cognitive Development19\(1\),pp\. 15–34\.Cited by:[§4](https://arxiv.org/html/2606.02953#S4.p5.1)\.
- M\. Tomasello, D\. Kuhn, and R\. Siegler \(2008\)Acquiring linguistic constructions\.Child and adolescent development,pp\. 263\.Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1)\.
- M\. Tomasello \(2003\)Constructing a language: a usage\-based theory of language acquisition\.Harvard University Press\(en\)\.External Links:ISBN 978\-0\-674\-01764\-1Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1)\.
- M\. Tomasello \(2006\)Construction grammar for kids\.Constructions\.Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1)\.
- M\. Tomasello \(2009\)The usage\-based theory of language acquisition\.InThe Cambridge handbook of child language,pp\. 69–87\.Cited by:[§1](https://arxiv.org/html/2606.02953#S1.p1.1)\.
- A\. Warstadt, A\. Parrish, H\. Liu, A\. Mohananey, W\. Peng, S\. Wang, and S\. R\. Bowman \(2020\)BLiMP: the benchmark of linguistic minimal pairs for english\.Transactions of the Association for Computational Linguistics8,pp\. 377–392\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p2.1)\.
- L\. Weissweiler, V\. Hofmann, A\. Köksal, and H\. Schütze \(2022\)The better your syntax, the better your semantics? probing pretrained language models for the english comparative correlative\.arXiv preprint arXiv:2210\.13181\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p3.1)\.
- L\. Weissweiler, V\. Hofmann, A\. Köksal, and H\. Schütze \(2023\)Explaining pretrained language models’ understanding of linguistic structures using construction grammar\.Frontiers in Artificial Intelligence6\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p3.1)\.
- E\. Wilcox, C\. Meister, R\. Cotterell, and T\. Pimentel \(2023\)Language model quality correlates with psychometric predictive power in multiple languages\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7503–7511\.Cited by:[§8](https://arxiv.org/html/2606.02953#S8.p5.1)\.

## Appendix: Class Identification

Table 7:Categorization accuracy on nonce items: rate at which models classified nonce*a\-*adjectives as “restricted” and nonce controls as “ordinary\.”Table[7](https://arxiv.org/html/2606.02953#Sx1.T7)shows models recognize even nonce a\-adjectives as such close to 100% of the time\.

Table 8:Categorization accuracy on nonce verbs: rate at which models classified transfer\-of\-possession nonce verbs as “transfer” and motion nonce verbs as “motion\.”Table[8](https://arxiv.org/html/2606.02953#Sx1.T8)demonstrates that models perfectly \(100% of the time\) classify even nonce verbs as either transfer or motion verbs based on the context\-setting sentences\.

Similar Articles

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

arXiv cs.CL

This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.

Language Acquisition Device in Large Language Models

arXiv cs.CL

This paper proposes LAD-inspired pre-pretraining using a formal language called MP-Struct that encodes natural-language-like structures. It shows that this approach improves token efficiency and imparts human-like resistance to structurally implausible languages, challenging prior hypotheses about effective pre-pretraining languages.

On the Persistent Effects of Lexicality in Large Language Mod

arXiv cs.CL

This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.