Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks
Summary
This paper proposes collocational bootstrapping, a mechanism by which statistical word co-occurrence cues can aid the acquisition of English subject-verb agreement, supported by neural network simulations and analysis of child-directed speech.
View Cached Full Text
Cached at: 05/21/26, 06:33 AM
# Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks
Source: [https://arxiv.org/html/2605.20529](https://arxiv.org/html/2605.20529)
Claire Hobbs Cognitive Science Program Yale University claire\.hobbs@yale\.edu &R\. Thomas McCoy Dept\. of Linguistics & Wu Tsai Institute Yale University tom\.mccoy@yale\.edu
###### Abstract
In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co\-occurrence patterns can provide cues to syntactic dependencies\. We investigate whether this mechanism can support the acquisition of English subject\-verb agreement\. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject\-verb pairings are\. We find that there is a range of variability levels at which these statistical learners robustly learn subject\-verb agreement\. We then analyze the variability of subject\-verb pairings in child\-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations\. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive\.
Collocational bootstrapping: A hypothesis about the learning of subject\-verb agreement in humans and neural networks
Claire HobbsCognitive Science ProgramYale Universityclaire\.hobbs@yale\.eduR\. Thomas McCoyDept\. of Linguistics & Wu Tsai InstituteYale Universitytom\.mccoy@yale\.edu
## 1Introduction
The sentences that we encounter do not come annotated with explicit syntax trees\. How, then, do children acquire a language’s syntax? Some proposals postulate innate predispositions that might guide children toward particular structural analyses\(e\.g\., Chomsky,[1965](https://arxiv.org/html/2605.20529#bib.bib49)\)\. Other proposals point to ways in which non\-syntactic aspects of the data, such as semanticsWexler and Culicover \([1980](https://arxiv.org/html/2605.20529#bib.bib35)\)or prosodyMorgan and Demuth \([1996](https://arxiv.org/html/2605.20529#bib.bib24)\), might provide helpful cues from which the learner can “bootstrap” syntactic structure\. For instance, prosodic boundaries tend to coincide with syntactic boundaries such that observable prosody might point toward accurate analyses of unobservable syntax\.
Recent advances in artificial intelligence have given prominence to one particular type of data\-driven cue: statistical properties of linguistic strings\. Neural network models trained to capture the statistical properties of corpora perform well on tests that target syntactic phenomena such as filler\-gap dependenciesWilcoxet al\.\([2024](https://arxiv.org/html/2605.20529#bib.bib36)\), negative polarity itemsJumelet and Hupkes \([2018](https://arxiv.org/html/2605.20529#bib.bib10)\), and subject\-auxiliary inversionMuelleret al\.\([2022](https://arxiv.org/html/2605.20529#bib.bib25)\)\. These systems do not have explicit syntactic predispositions built into them, so their strong syntactic abilities suggest that naturalistic text possesses statistical cues from which much of syntax can be inferred\. This evidencethatstatistical properties can pave the way to syntax raises the question ofhowthey might do so\.
In this work, we proposecollocational bootstrappingas a hypothesis for one specific way in which statistical cues could contribute to syntactic acquisition\. Under this hypothesis, syntactic structure can be inferred from trends regarding which words frequently co\-occur\. As a case study, we consider English subject\-verb agreement, the phenomenon in which a verb must have the same grammatical number as its subject \(e\.g\.,the dogs barkis grammatical, butthe dogs barksis not\)\. One challenge for acquiring subject\-verb agreement is that there are \(at least\) two potential rules that could explain most examples of this phenomenon:
\\ex
\.Agree\-Subject: A verb should agree with its subject\.
\\ex
\.Agree\-Recent: A verb should agree with the closest preceding noun\.
We can tell thatAgree\-Subjectis the correct rule by considering sentences where these rules make different predictions; e\.g\., when choosing a verb for the sentencethe dogs in the park \[bark/barks\],Agree\-Subjectwould correctly choosebarkwhileAgree\-Recentwould incorrectly choosebarks\. However, for most naturally\-occurring sentences, a verb’s subject is also the most recent noun, meaning that it may be challenging for learners to identify which rule is correct\.
The proposed mechanism of collocational bootstrapping \(described in more detail in Section[3](https://arxiv.org/html/2605.20529#S3)\) provides one way to select between these two rules even in the absence of direct disambiguating examples such asthe dogs in the park bark\. Under the collocational bootstrapping hypothesis, learners leverage information about word co\-occurrence as a window into syntactic dependencies\. E\.g\., giventhe dog on the couch barks, a learner could infer that there is more likely to be a dependency betweendogandbarksthan betweencouchandbarksbecausedogis more likely to co\-occur withbarksthancouchis\. After inferring many such potential dependencies, the learner then abstracts away from the specific words that are involved to recognize the abstract syntactic configurations that are truly at the heart of the dependency\.
To investigate whether collocational bootstrapping is a viable strategy, we trained multiple neural network language models on synthetically\-generated datasets\. Because collocational bootstrapping depends on associations between subjects and verbs, we varied the extent to which a subject could be predicted from its verb\. Specifically, subjects were sampled from Zipfian distributionsZipf \([1949](https://arxiv.org/html/2605.20529#bib.bib66)\)where the probability of a verb’srthr^\{\\text\{th\}\}most frequent subject is proportional to1/rα1/r^\{\\alpha\}; varying the parameterα\\alphamodulates how predictable the subject is given the verb\. Critically, the models’ training sets were constrained to be fully ambiguous betweenAgree\-SubjectandAgree\-Recent\(e\.g\., using sentences likethe dog in the park barks\), but the systems were then evaluated on sentences that disambiguated these rules\.
We find that subject\-verb co\-occurrence statistics have a substantial effect on how well the models learn subject\-verb agreement; there are some statistical settings \(namely, whenα≈1\.4\\alpha\\approx 1\.4, yielding moderate variability\) where the models successfully learnAgree\-Subjectand others where they do not \(namely, whenα\\alphais very low—producing highly variable data—or very high—producing highly predictable data\)\. The fact that certain statistical configurations support effective generalization supports the collocational bootstrapping hypothesis\. Given that collocational bootstrapping is only effective in certain statistical settings, we next perform a corpus analysis of a dataset of child\-directed language to see whether children’s input has the properties that supported success in our simulations\. We find preliminary evidence that child\-directed language indeed has the requisite properties\.
Overall, our neural\-network experiments provide a proof of concept showing that collocational bootstrapping can guide a learner to accurate syntactic analyses, and our corpus analysis suggests that children’s input has the statistical properties that make collocational bootstrapping effective\. This work is a step toward understanding how quantitative aspects of a learner’s input can support the learning of abstract, qualitative syntactic phenomena\.111Our code is available on GitHub:[https://github\.com/ClaireHobbs/collocational\-bootstrapping](https://github.com/ClaireHobbs/collocational-bootstrapping)\.
## 2Background and Related Work
#### Bootstrapping in language acquisition:
Several mechanisms have been proposed by which learners might infer aspects of syntax from non\-syntactic information such as prosodyMorgan and Demuth \([1996](https://arxiv.org/html/2605.20529#bib.bib24)\)or meaningWexler and Culicover \([1980](https://arxiv.org/html/2605.20529#bib.bib35)\); Pinker \([1984](https://arxiv.org/html/2605.20529#bib.bib32)\); Abendet al\.\([2017](https://arxiv.org/html/2605.20529#bib.bib1)\); Yedetore and Kim \([2024](https://arxiv.org/html/2605.20529#bib.bib41)\)\. The most relevant prior proposal is distributional bootstrapping, in which syntactic categories can be inferred from distributional properties—e\.g\., words occurring in similar contexts likely belong to the same part of speechMaratsos and Chalkley \([1980](https://arxiv.org/html/2605.20529#bib.bib16)\); Finch and Chater \([1992](https://arxiv.org/html/2605.20529#bib.bib6)\); Mintz \([2003](https://arxiv.org/html/2605.20529#bib.bib20)\)\. Like distributional bootstrapping, collocational bootstrapping leverages distributional properties of words, but it is a strategy for acquiring relationships between words rather than word categories\. Another proposal that is potentially related to collocational bootstrapping is semantic bootstrappingWexler and Culicover \([1980](https://arxiv.org/html/2605.20529#bib.bib35)\); see Section[6](https://arxiv.org/html/2605.20529#S6)for discussion\. The various types of bootstrapping are not mutually exclusive—children might use many or all of them\.
#### Subject\-verb agreement in neural networks:
A substantial body of work has investigated whether neural networks can learn English subject\-verb agreementElman \([1991](https://arxiv.org/html/2605.20529#bib.bib5)\); Linzenet al\.\([2016](https://arxiv.org/html/2605.20529#bib.bib56)\)\. Such networks have been found to be capable of robustly learning subject\-verb agreement from naturalistic textKuncoroet al\.\([2018](https://arxiv.org/html/2605.20529#bib.bib12)\); Gulordavaet al\.\([2018](https://arxiv.org/html/2605.20529#bib.bib51)\); Goldberg \([2019](https://arxiv.org/html/2605.20529#bib.bib50)\); Weiet al\.\([2021](https://arxiv.org/html/2605.20529#bib.bib65)\)\. In this work, we train networks on controlled synthetic data to analyze what statistical properties of corpora might be supporting such learning\. Our approach shares withWeiet al\.\([2021](https://arxiv.org/html/2605.20529#bib.bib65)\)the strategy of training neural networks on corpora that vary in controlled ways, but we investigate a different factor \(namely the predictability of the subject given the verb, as opposed to word frequency\)\.
#### Distributional cues to syntax:
Both the acquisition literature and the computational literature have discussed what properties of a learner’s input might support the acquisition of syntax\. Properties discussed include the presence of sentences that might directly disambiguate competing hypothesesPullum and Scholz \([2002](https://arxiv.org/html/2605.20529#bib.bib33)\); Mulliganet al\.\([2021](https://arxiv.org/html/2605.20529#bib.bib26)\), the presence of one phenomenon that might be helpful for acquiring different phenomenaPearl and Mis \([2016](https://arxiv.org/html/2605.20529#bib.bib31)\); Patilet al\.\([2024](https://arxiv.org/html/2605.20529#bib.bib30)\); Misra and Mahowald \([2024](https://arxiv.org/html/2605.20529#bib.bib21)\); Yanget al\.\([2026a](https://arxiv.org/html/2605.20529#bib.bib40)\), the semantic features of a word’s argumentsMisra and Kim \([2024](https://arxiv.org/html/2605.20529#bib.bib22)\), the statistical properties of function wordsYanget al\.\([2026b](https://arxiv.org/html/2605.20529#bib.bib39)\), the frequencies of particular wordsWeiet al\.\([2021](https://arxiv.org/html/2605.20529#bib.bib65)\); Leong and Linzen \([2026](https://arxiv.org/html/2605.20529#bib.bib14)\), the diversity and complexity of observed syntactic structuresQinet al\.\([2025](https://arxiv.org/html/2605.20529#bib.bib34)\), and the frequencies of syntactic configurationsWonnacottet al\.\([2008](https://arxiv.org/html/2605.20529#bib.bib37)\); Yang \([2016](https://arxiv.org/html/2605.20529#bib.bib38)\)\. We instead study the distributional feature of variability in word co\-occurrence statistics\.
#### The role of variability in learning:
Across domains of cognition, the tradeoff between predictability and variability is a central tension for learningRavivet al\.\([2022](https://arxiv.org/html/2605.20529#bib.bib60)\): predictable input can support faster learning, while variable input supports more abstract generalizations\. Our work applies this idea to the learning of English subject\-verb agreement by investigating variability in subject\-verb pairings\. The most relevant prior papers areGómez \([2002](https://arxiv.org/html/2605.20529#bib.bib8)\)andOnniset al\.\([2004](https://arxiv.org/html/2605.20529#bib.bib28)\)\. In both, human participants were shown strings of the formaXb, where the first and third elements had an agreement dependency \(akin to subject\-verb agreement\), and theXelements were arbitrary\. These papers found that people can learn such patterns more readily when the set of possibleXelements is larger, showing that variability in intervening elements can support learning of nonadjacent syntactic dependencies\.
Our work differs from these papers in that we investigate variability in subject\-verb pairings, rather than variability in the intervening material\. Further, we test artificial neural networks rather than humans\. Finally, while these prior papers achieved greater variability by increasing the number of possibleXentities, we held the relevant sets constant and varied only the frequencies of their elements\. This choice means that our conditions differ only quantitatively—there are no qualitative differences regarding which pairings are present \(except in one case, theα→∞\\alpha\\to\\inftycase\)\.
## 3The Collocational Bootstrapping Hypothesis
Words are not distributed uniformly in natural language use\. Most relevantly for this paper, a given verb is much more likely to have some subjects than others due to the meanings that people are likely to express \(e\.g\.,the dog barkedis much more likely thanthe potato barked\)\. We hypothesize that co\-occurrence properties are likely to be especially systematic for words that share a syntactic dependency, such that learners could leverage co\-occurrence information to help learn aspects of syntax, such as subject\-verb agreement\.
As discussed above, there is a tension between predictability and variability\. If a given verb always had the same subject, it would be easy for the learner to recognize which noun the verb is paired with such thatAgree\-Subjectcan be selected overAgree\-Recent\. However, the learner in this setting might simply memorize the few subject\-verb pairings it has seen, thereby failing to generalize to novel pairings\. At the other extreme, if subjects are sampled uniformly, this high degree of variability should support generalization to novel subject\-verb pairings, but it would not provide any systematic co\-occurrence information that would help point to which noun is the verb’s subject\.
Given this tension, the key question underlying our first experiment is whether there exists a level of variability that supports correct generalization of subject\-verb agreement—a level that is predictable enough to make subject\-verb associations apparent yet variable enough to support generalization to novel subject\-verb pairs\. To get at this question, we use simulations with neural networks trained on simple, synthetic grammars\. Using synthetic grammars enables us to fully control and understand which cues are available to the learners so that we can isolate the statistical factors we have highlighted, in the same spirit as other connectionist work that similarly analyzes how neural networks generalize in simple, controlled settingsElman \([1990](https://arxiv.org/html/2605.20529#bib.bib4),[1991](https://arxiv.org/html/2605.20529#bib.bib5)\); Franket al\.\([2013](https://arxiv.org/html/2605.20529#bib.bib7)\); McCoyet al\.\([2020](https://arxiv.org/html/2605.20529#bib.bib18)\)\.
Table 1:Templates used in training set generation\. Det = determiner, N = noun, PP = prepositional phrase, V = verb\.
## 4Experiment 1: Neural Networks
Neural networks allow us to simulate language acquisition in a statistical learner that lacks explicit predispositions for specific syntactic structures\. To study the effect of certain statistical properties on learnability, we can train a neural language model on synthetic datasets in which we vary these properties in controlled ways\. In our case, to modulate the level of variability in subject\-verb pairings, we sample pairings from Zipfian distributions \(defined by the equation below\) that vary the parameterα\\alpha; note thatα\\alphais a free parameter whileKKis a normalizing constant whose value is fully determined by the need for the set off\(r\)f\(r\)values to sum to 1:
f\(r\)=Krαf\(r\)=\\frac\{K\}\{r^\{\\alpha\}\}\(1\)Ourα\\alphavalues ranged from 0 to 3, with lowerα\\alphavalues producing highly variable pairings and higherα\\alphavalues producing predictable pairings, and we included anα→∞\\alpha\\to\\inftyscenario in which each subject was seen paired with only one verb\. After training, we evaluated the model’s ability to generalize subject\-verb agreement beyond its training data\.
This highly simplified setup creates a proof\-of\-concept test to determine whether there exist situations in which collocational bootstrapping would be an effective strategy for a statistical learner\. Specifically, there is no guarantee that collocational bootstrapping can ever succeed because it may be that all training sets are either too predictable to support abstract generalization or too variable to provide a clear statistical signal \(see Section[3](https://arxiv.org/html/2605.20529#S3)\)\. By modulatingα\\alpha, we investigate multiple levels of variability to see if there exist levels that resolve this tension between predictability and variability, such that conditions exist under which collocational bootstrapping can succeed\.
Table 2:Example minimal pairs for a moderately variable condition \(α\\alpha= 1\.5\)\. Underlining indicates that the noun has not been seen in training data paired with this verb\.### 4\.1Data
We created synthetic datasets containing 12,000 unique grammatically\-correct sentences for everyα\\alphavalue tested\. Each sentence contained a subject \(made of a determiner and a noun\) and an intransitive verb, and it could optionally include a prepositional phrase before and/or after the subject\. This setup produced four sentence templates \(Table[1](https://arxiv.org/html/2605.20529#S3.T1)\)\. Within each sentence, all nouns had the same number: Either all were singular, or all were plural\. This meant that the training set was always ambiguous betweenAgree\-SubjectandAgree\-Recentas well as many other potential rules \(e\.g\.,Agree\-First, in which a verb agrees with the first noun in the sentence\)\. We enforced this ambiguity to isolate the hypothesized effect of collocational bootstrapping: Are co\-occurrence patterns sufficient to disambiguate candidate agreement rules even in the absence of sentences that would directly disambiguate these rules? See Section[6](https://arxiv.org/html/2605.20529#S6)for discussion of future directions that relax this strict ambiguity\.
Our vocabulary included 40 nouns and 40 present\-tense verbs \(each of which could be singular or plural, creating a total of 80 nouns and 80 verbs\)\. For reasons described at the end of this section, each noun in the vocabulary is derived from one of the verbs, creating noun/verb pairs such asorator/orates\. We chosetheas the only determiner, and we usedbyandnearas our prepositions\.
To explain how sentences were sampled, we first give each noun stem and verb stem a numerical index from 0 to 39\. To generate a sentence, first the verb was sampled uniformly from among the 80 options; call its indexii\. A subject was then sampled from a truncated Zipfian distribution with parameterα\\alphathat assigns the following unnormalized probability to each of the 40 nouns that can agree with verbii, denoting a given noun’s index asjj:1/\(\(j−imod40\)\+1\)α1/\(\(j\-i\\mod 40\)\+1\)^\{\\alpha\}ifj−imod40<30j\-i\\mod 40<30, else0\. That is, for each verb, there are 10 nouns that are withheld from appearing as the verb’s subject so that we can evaluate how the models generalize to novel subject\-verb pairs\. If the sentence contained prepositional phrases, the prepositional object nouns were sampled uniformly from among the nouns with indicesiitoi\+29mod40i\+29\\mod 40that have the same number as the subject\. One effect of this setup is that the same 10 nouns that were withheld from appearing as the verb’s subject were also withheld from appearing as prepositional objects for purposes of evaluation\.
By controllingα\\alpha, we can adjust the level of variability in the training data with respect to which subjects appeared with which verbs\. Whenα\\alpha= 0, the subjects were distributed uniformly, creating a dataset with the highest level of variability\. At the other extreme, whenα→∞\\alpha\\to\\infty, each verb \(e\.g\.,orate\) had only one noun that ever appeared as its subject—specifically, the noun that was morphologically related to it \(e\.g\.,orator\)\. For in\-between values ofα\\alpha, the nouns available for verb pairings were spread across a truncated Zipfian distribution as defined above, with the most likely subject being the one that is morphologically related to the verb, and other nouns having probabilities that decrease following Equation[1](https://arxiv.org/html/2605.20529#S4.E1)\. We used Zipfian distributions because many linguistic units have been empirically shown to follow them, including in child\-directed languageLavi\-Rotbain and Arnon \([2023](https://arxiv.org/html/2605.20529#bib.bib55)\); see Section[5\.3](https://arxiv.org/html/2605.20529#S5.SS3)for evidence that subject\-verb pairings follow Zipfian distributions in child\-directed language\. Figure[1](https://arxiv.org/html/2605.20529#S4.F1)shows the distribution of nouns at each selectedα\\alphavalue, and sample training sentences can be found in Appendix[A](https://arxiv.org/html/2605.20529#A1)\.
Note that our models do not have access to the spellings of words; they are presented with words as atomic tokens\. Thus, they cannot leverage the morphological cues of noun inflection, verb inflection, and the relatedness of certain nouns and verbs \(e\.g\.,painterandpaint\)—these morphological properties are included in the dataset to make the sentences easier for humans to reason about, but these properties play no role in the models’ learning\. Additionally, our dataset makes many simplifying assumptions compared to natural language use; see Section[6](https://arxiv.org/html/2605.20529#S6)for discussion\.
Figure 1:Noun probability distributions acrossα\\alphavalues \(log scale\)\. Lowerα\\alphavalues produce flatter distributions with more uniform noun usage, while higherα\\alphavalues concentrate probability on fewer nouns\. The distributions were truncated at the dotted line to leave some nouns unseen as the subjects of particular verbs\.
### 4\.2Models
We trained and evaluated 2\-layer decoder\-only Transformer language modelsVaswaniet al\.\([2017](https://arxiv.org/html/2605.20529#bib.bib64)\)in the style of GPT\-2Radfordet al\.\([2019](https://arxiv.org/html/2605.20529#bib.bib59)\), adapted from the nanoGPT implementationKarpathy \([2023](https://arxiv.org/html/2605.20529#bib.bib53)\), which enables lightweight, research\-oriented versions of GPT\-2 to be trained from scratch\. Our models used two transformer layers, each with four attention heads, an embedding size of 256, and approximately 1\.6 million parameters\. All code was developed using PyTorch\.
### 4\.3Training
For eachα\\alphavalue from 0\.0 to 3\.0 inclusive in increments of 0\.1, as well as the case whereα→∞\\alpha\\to\\infty, we did 10 training runs with different random weight initializations\. For each run, we generated a new set of 12,000 unique sentences and used a split of 80% train / 10% validation / 10% test\. We used AdamWKingma and Ba \([2015](https://arxiv.org/html/2605.20529#bib.bib11)\); Loshchilov and Hutter \([2019](https://arxiv.org/html/2605.20529#bib.bib15)\)with a learning rate of 0\.0006 which remained fixed throughout training\. The batch size was 32, and each training run used 300 batches per epoch for 4 epochs \(1,200 total iterations\)\. The validation loss was computed every 300 steps, and the model version with the best validation loss was saved\. Training and validation losses tracked each other closely \(see Figure[6](https://arxiv.org/html/2605.20529#A4.F6)in the Appendix\), indicating that overfitting was not a concern\.
Figure 2:Model accuracy vs\. Zipfian parameterα\\alphaacross four evaluation conditions\. There is an optimal point whereα=1\.4\\alpha=1\.4at which models perform robustly in all test conditions\. Error bars show one standard deviation\.
### 4\.4Evaluation and Results
To evaluate model performance, we generated four sets of 1,000 minimal pair sentencesMarvin and Linzen \([2018](https://arxiv.org/html/2605.20529#bib.bib17)\), each targeting a different testing condition\. In each pair, the first sentence was grammatical, and the second was ungrammatical due to the verb not matching the subject’s number\. Sample minimal pairs are in Table[2](https://arxiv.org/html/2605.20529#S4.T2)\.
We assessed each model’s preferences by calculating the log probability it assigned to each sentence in a pair\. For each pair, we considered the model to be correct if it assigned a higher log probability to the grammatical sentence than to the ungrammatical one, and we then computed the overall accuracy across the 1,000 pairs in each set\.
The four sets varied in difficulty according to whether the subject\-verb pairings and prepositional objects had appeared in the model’s training data \(SEEN\) or not \(UNSEEN\), and whether the grammatical number of prepositional object nouns matched that of the subject \(MATCH\) or not \(MISMATCH\)\. The number mismatches served as attractors to assess whether the model had learnedAgree\-Subject\(which would identify the correct noun for the verb to agree with\) or an incorrect strategy such asAgree\-RecentorAgree\-First\(both of which would select the incorrect verb inflection\)\. All minimal pairs shared a uniform syntactic structure, \[PP Det N PP V\], presenting the model with three competing nouns as possible agreement targets for the verb\. Below, we define these four conditions in detail and present results for each\.
#### SEEN, MATCH:
The sentences in this condition used subject\-verb pairings the model encountered during training, with prepositional objects matching the subject’s number\. This condition presented the lowest difficulty for the models, serving primarily to verify that the models had successfully learned the patterns present in the training data\. Across allα\\alphavalues, the models achieved 100% or near 100% accuracy as shown by the blue line in Figure[2](https://arxiv.org/html/2605.20529#S4.F2)\.
#### UNSEEN, MATCH:
Here, we introduce a source of lexical difficulty\. For a given verb, the subject and prepositional objects in the test sentences were ones that had never appeared in the same sentence as that verb during training\. As in the previous condition, the prepositional objects matched the subject’s number\. Success in this condition requires the model to generalize across words of the same number—that is, to use distributional commonalities to form the classes ofsingular nounsandplural nouns, and to recognize that the same verb form applies to any member of that class\. This type of generalization should be easiest whenα\\alphais low, meaning that all the singular nouns have similar distributions and are therefore easier for the model to group together into a cohesive class, and similarly for the plurals\. As expected, accuracy was high for lowα\\alphavalues but began to decline whenα≈1\.4\\alpha\\approx 1\.4\(see the red line in Figure[2](https://arxiv.org/html/2605.20529#S4.F2)\)\. At highα\\alphavalues, the models appear to learn strong associations between specific subjects and verbs, preventing them from generalizing well to unseen ones\.
#### SEEN, MISMATCH:
This condition presents a different type of difficulty: conflicting cues about number agreement\. If the model has incorrectly learned that the verb should agree either with the closest noun or with the first noun in the sentence—both of which would succeed for all sentences in the training set—it will now fail when tested with sentences in which the prepositional objects have a different grammatical number than the subject\. Highα\\alphavalues in the training data create greater predictability in subject\-verb pairings, which we hypothesize would help models selectAgree\-Subjectover other competing rules by making it easier to recognize which syntactic positions host the noun that the verb shares a syntactic dependency with\. As shown by the yellow line in Figure[2](https://arxiv.org/html/2605.20529#S4.F2), results confirmed that the models perform poorly at lowα\\alphavalues but with high accuracy \(approaching 100%\) asα\\alphaincreases\.
#### UNSEEN, MISMATCH:
Our final condition combines both sources of difficulty: novel subject\-verb pairings and prepositional objects that have a different number from the subject \(note that the prepositional object\-verb pairings are also novel, as in the UNSEEN, MATCH condition\)\. As above, we predict that lowα\\alphavalues will prevent the model from generalizing because it will struggle to identify the correct agreement target among mismatching competitors, and that highα\\alphavalues will also cause the model to perform poorly, as it will struggle to generalize to unseen noun/verb pairings\. What happens between the low and highα\\alphavalues is harder to predict\. The critical question is whether there exists a “sweet spot” between extremes, where the model can handle both the UNSEEN and MISMATCH aspects of this condition\. We find that there is indeed such a sweet spot \(Figure[2](https://arxiv.org/html/2605.20529#S4.F2), purple line\): The model showed poor performance at low and highα\\alphavalues, but there is a peak with near\-perfect accuracy at intermediate values\.
### 4\.5Discussion
The results show there is an ideal level of variability in subject\-verb pairings in the training data that helps the model generalize robustly\. Too much variation hinders the model from inferring the correct syntactic structure\. Too much predictability prevents the model from forming an abstract rule, such that it generalizes poorly to novel subject\-verb pairings\. Between these extremes, whenα≈1\.4\\alpha\\approx 1\.4, there is an optimal level of variability that supports robust generalization\. This pattern demonstrates two key points\. First, the fact that model performance varies with the level of variability indicates that these neural networks indeed use co\-occurrence statistics to inform the learning of subject\-verb agreement in the ways expected under the collocational bootstrapping hypothesis\. Second, the existence of an optimal level at which we get robust generalization shows that, under the right conditions, collocational bootstrapping can be a viable learning strategy\.
## 5Experiment 2: Analysis of CHILDES
In the previous experiment, we observed that whenα≈1\.4\\alpha\\approx 1\.4, the synthetic training data contained a level of variability that optimizes generalization\. We now investigate whether a similar statistical signal is present in real\-world data such that children could potentially leverage this signal to assist in learning subject\-verb agreement\. Toward this end, we consider the frequency of subject\-verb pairings in a corpus of child\-directed language\.
### 5\.1Data
CHILDES, or theChildLanguageDataExchangeSystem, is an open repository of transcripts and other supporting media containing conversations between children and their caretakers, which have been compiled and donated by researchersMacWhinney \([2000](https://arxiv.org/html/2605.20529#bib.bib57)\)\. For this experiment, we used data from CHILDES participants tagged as English\-language speakers\. Because our goal is to understand the linguistic input that a child might receive \(and not the utterances that the child produces\), we filtered the speakers to include only those speaking to children but not the children themselves\.
We extracted all utterances spoken by an adult to children with ages from 0 to 96 months, resulting in a set of 4,739,189 utterances; see Appendix[E](https://arxiv.org/html/2605.20529#A5)for more details about data filtering and cleaning\.
We parsed the utterances using the spaCy dependency parserHonnibalet al\.\([2020](https://arxiv.org/html/2605.20529#bib.bib52)\)\. We extracted all pairings of a subject noun and the corresponding verb \(the pairings characterized by the dependency typensubj\), using lemmas for both subjects and verbs\. This produced 2,802,071 subject\-verb pairs\.
### 5\.2Zipfian Analysis
We analyzed the subjects of the 100 most frequent verbs\. We restricted ourselves to frequent verbs so that there would be sufficient data to achieve quantitatively meaningful results; all verbs in this set appeared at least 2,396 times\. For each verb, we created a list of the subjects that co\-occur with that verb, ranked by the number of times that the subject\-verb pairing appeared\. Next, we converted these subject counts to proportions and calculated the average proportion of the subjects at each rank across all verbs\. That is, for each rankrrwe computedfempirical\(r\)f\_\{\\text\{empirical\}\}\(r\)—the average frequency of a verb’srthr^\{\\text\{th\}\}most common subject—as follows, whereverbk\\text\{verb\}\_\{k\}is thekkth most common verb, andsubjr,k\\text\{subj\}\_\{r,k\}is the noun that occurs as therrth most common subject forverbk\\text\{verb\}\_\{k\}:
fempirical\(r\)=1100∑k=1100count\(subjr,k,verbk\)count\(verbk\)f\_\{\\text\{empirical\}\}\(r\)=\\frac\{1\}\{100\}\\sum\_\{k=1\}^\{100\}\\frac\{\\text\{count\}\(\\text\{subj\}\_\{r,k\},\\text\{verb\}\_\{k\}\)\}\{\\text\{count\}\(\\text\{verb\}\_\{k\}\)\}\(2\)This formula gives the empirical frequencies of verb\-subject pairings, which we then sought to fit to the theoretical predictions of Zipf’s Law:
ftheoretical\(r,α\)=Krαf\_\{\\text\{theoretical\}\}\(r,\\alpha\)=\\frac\{K\}\{r^\{\\alpha\}\}\(3\)
Zipf’s Law has one free parameterα\\alpha\(note thatKKis a normalizing constant, so it is fully determined byα\\alpha\), so fittingftheoreticalf\_\{\\text\{theoretical\}\}tofempiricalf\_\{\\text\{empirical\}\}amounted to finding the value ofα\\alphathat best fit the observed data\. To do so, we tried all values ofα\\alpharanging from 0 to 3\.0 in increments of 0\.01\. For eachα\\alphavalue, we computed the mean squared error \(MSE\) betweenfempiricalf\_\{\\text\{empirical\}\}andftheoreticalf\_\{\\text\{theoretical\}\}, defined as:
MSE\(α\)=1R∑r=1R\(fempirical\(r\)−ftheoretical\(r,α\)\)2\\text\{MSE\}\(\\alpha\)=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\\left\(f\_\{\\text\{empirical\}\}\(r\)\-f\_\{\\text\{theoretical\}\}\(r,\\alpha\)\\right\)^\{2\}\(4\)whereRRis the number of ranks over which we computed the error\. We then selected theα\\alphavalue that minimizedMSE\(α\)\\text\{MSE\}\(\\alpha\)\.
### 5\.3Results
Figure 3:The empirical distribution of subject\-verb pairings in CHILDES \(averaged across verbs in accordance with Equation[2](https://arxiv.org/html/2605.20529#S5.E2)\), along with the frequencies predicted by a Zipfian distribution with parameterα=1\.43\\alpha=1\.43\(α\\alphawas chosen by finding the best fit to the data\)\.Figure 4:The fitted Zipf parameterα\\alphadecreases with child age\. The dashed red line indicates the optimalα\\alphafound in neural network simulations; the dotted purple line indicates the overall corpusα\\alpha\.We found that the best\-fitting value ofα\\alphawasα\\alpha=1\.43\. See Figure[3](https://arxiv.org/html/2605.20529#S5.F3)for a comparison betweenfempiricalf\_\{\\text\{empirical\}\}andftheoreticalf\_\{\\text\{theoretical\}\}with thisα\\alphavalue\. In addition to the dataset\-wide fitting described above, we also broke down the analysis by the age of the child being spoken to in order to see whether the best\-fittingα\\alphavalue varied by the age of the target child\. As shown in Figure[4](https://arxiv.org/html/2605.20529#S5.F4), the Zipfian parameterα\\alphagenerally decreases as the target child’s age increases\. Sample utterances by age group are in Appendix[C](https://arxiv.org/html/2605.20529#A3)\.
Strikingly, both theα\\alphavalue calculated for all utterances \(α\\alpha=1\.43\) and the range ofα\\alphavalues found for each age group \(α\\alpha=1\.46 to 1\.23\) are close to the value ofα\\alphawhere our model generalized best,α≈1\.4\\alpha\\approx 1\.4\. This finding suggests that naturalistic English input has a level of subject\-verb variability that facilitates the acquisition of agreement\.
## 6Discussion
We have used neural network language models to show that it is possible for statistical learners to robustly generalize English subject\-verb agreement by using collocational bootstrapping\. This bootstrapping strategy only succeeds under certain statistical conditions \(when theα\\alphaparameter in Zipf’s law is about 1\.4\); we have further found preliminary evidence that child\-directed speech has the right properties for this strategy to be viable\.
#### Making inferences about child language acquisition:
Due to the many differences between our synthetic text and natural child\-directed language, we do not intend to draw strong conclusions about the similarity between the model\-optimalα\\alphavalue \(≈\\approx1\.4\) and the empiricalα\\alphafound in CHILDES \(1\.43\)\. It is worth noting that the type of simulations we conducted, whether done with fully synthetic data or data closer to child\-directed language, can only provide evidence about which learning strategies could be effective, not whether children actually use those strategies during acquisition\.
#### Toward greater naturalness:
Our synthetic data sets were highly simplified, differing from naturalistic language in important ways\. First, our existing data sets likely over\-represent the presence of prepositional phrases before the verb\. Second, our training sets were fully ambiguous betweenAgree\-SubjectandAgree\-Recentwhereas naturalistic data contain some disambiguating examples—though naturalistic data can also contain agreement attraction errorsBock and Miller \([1991](https://arxiv.org/html/2605.20529#bib.bib2)\)that point towardAgree\-Recentrather thanAgree\-Subject\. Third, naturalistic data involve a much larger vocabulary than what we used here\. Fourth, naturalistic English sentences often use verbs \(e\.g\., past\-tense verbs\) that are not explicitly inflected for number\.
Beyond these differences in the word sequences encountered by our models vs\. human children, our models also differ from human learners in only having access to text, whereas children receive multi\-modal input that might support types of bootstrapping not available to our models\. Children have access to prosody, which might provide syntactic cues through prosodic bootstrapping, and can also draw on real\-world context, which might provide meaning that can serve as a cue to syntax, as suggested under semantic bootstrapping\. Future work could explore the effects of modifying the training set in ways that overcome these qualitative gaps\.
#### The difficulty of agreement acquisition:
Our analysis of CHILDES found that its statistical properties make it well\-suited for collocational bootstrapping\. However, prior work has found that both childrenNozari and Omaki \([2022](https://arxiv.org/html/2605.20529#bib.bib27)\)and neural networks trained on child\-directed languageHuebneret al\.\([2021](https://arxiv.org/html/2605.20529#bib.bib9)\); Padovaniet al\.\([2025](https://arxiv.org/html/2605.20529#bib.bib29)\)make agreement attraction errors, meaning that they have not learned subject\-verb agreement as robustly as might be expected from our analysis\. A likely explanation for the discrepancy is that some of the other factors mentioned in the previous paragraph could counteract the favorableα\\alphavalue that we have observed in ways that add further difficulty to the acquisition task\. A goal to ultimately work toward is investigating which types of input data and learning strategies can reproduce both the successes and failures of subject\-verb agreement in humans\.
#### Statistical co\-occurrence or semantic relatedness?
Semantic bootstrapping leverages the meaning of words as a cue to syntax\. Since semantically related words often occur near each other, there may be overlap between semantic and collocational cues\. Indeed, past computational work has found a relationship between a word’s meaning and its statistical distribution in a corpusLandauer and Dumais \([1997](https://arxiv.org/html/2605.20529#bib.bib13)\); Mikolovet al\.\([2013](https://arxiv.org/html/2605.20529#bib.bib19)\)\. Future work could tease apart the respective roles of semantics and statistics as cues to syntactic dependencies\.
Semantic bootstrapping is typically framed as a mechanism for inferring syntactic categories such as parts of speech\. Collocational bootstrapping is instead a strategy for acquiring word\-word dependencies, under which learners can bootstrap from one type of word\-word relatedness \(co\-occurrence\) to another \(syntactic dependencies\)\. Since semantics and distribution overlap, this same broad strategy could instead use semantic relatedness rather than distributional co\-occurrence as a cue to syntactic dependencies, providing a way to extend semantic bootstrapping to the learning of dependencies\.
#### Extending collocational bootstrapping:
Another direction for future work is extending collocational bootstrapping by analyzing whether it is an effective strategy for learning other syntactic dependencies beyond the one studied here \(subject\-verb dependencies\)\. A natural first step would be to investigate other agreement phenomena in English and other languages, such as noun\-anaphor number agreement and adjective\-noun gender agreement\.
## 7Conclusion
We have proposed collocational bootstrapping as a potential mechanism by which word co\-occurrence statistics can support the learning of syntax\. We have tested our hypothesis using neural language models, training and evaluating them on synthetic data with varying levels of variability in subject\-verb pairings\. We have found that there is an optimal level of variability, specifically a Zipfian distribution withα≈1\.4\\alpha\\approx 1\.4, that maximizes the model’s ability to generalize\. Too little variability prevents the model from generalizing to novel noun\-verb pairs, and too much variability prevents it from abstracting syntactic rules\. Theα\\alphavalue at which there is a sweet spot for optimal generalization is consistent with the level of variability observed in child\-directed speech \(α\\alpha=1\.43\), suggesting that the statistical structure of natural language could guide learners in correctly acquiring syntax\. These results provide one illustration of how statistical properties of linguistic data can facilitate the learning of abstract syntactic phenomena\.
## Limitations
Our neural network experiments involve simplified, synthetic training data that differ from children’s input in qualitative ways, and we have only analyzed the effect of one statistical cue on one linguistic phenomenon; see Section[6](https://arxiv.org/html/2605.20529#S6)for discussion\.
## Acknowledgments
We extend our thanks to the anonymous reviewers and the feedback they provided on this paper, and to Jason Hobbs for his technical insight and support\. We used Claude Code for assistance with coding, and we checked all AI\-generated code\. We also used Grammarly, and Claude Opus 4\.6 and Sonnet 4\.5, for feedback on style and grammar, but all ideas in the paper were ours\. Any errors are our own\.
## References
- Bootstrapping language acquisition\.Cognition164,pp\. 116–143\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Bock and C\. A\. Miller \(1991\)Broken agreement\.Cognitive Psychology23\(1\),pp\. 45–93\.Cited by:[§6](https://arxiv.org/html/2605.20529#S6.SS0.SSS0.Px2.p1.1)\.
- N\. Chomsky \(1965\)Aspects of the theory of syntax\.MIT Press,Cambridge, MA\.Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p1.1)\.
- J\. L\. Elman \(1990\)Finding structure in time\.Cognitive Science14\(2\),pp\. 179–211\.Cited by:[§3](https://arxiv.org/html/2605.20529#S3.p3.1)\.
- J\. L\. Elman \(1991\)Distributed representations, simple recurrent networks, and grammatical structure\.Machine Learning7\(2\),pp\. 195–225\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.20529#S3.p3.1)\.
- S\. Finch and N\. Chater \(1992\)Bootstrapping syntactic categories\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.14\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Frank, D\. Mathis, and W\. Badecker \(2013\)The acquisition of anaphora by simple recurrent networks\.Language Acquisition20\(3\),pp\. 181–227\.Cited by:[§3](https://arxiv.org/html/2605.20529#S3.p3.1)\.
- Y\. Goldberg \(2019\)Assessing BERT’s syntactic abilities\.arXiv preprint arXiv:1901\.05287\.External Links:[Link](https://arxiv.org/abs/1901.05287)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px2.p1.1)\.
- R\. L\. Gómez \(2002\)Variability and detection of invariant structure\.Psychological Science13\(5\),pp\. 431–436\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px4.p1.1)\.
- K\. Gulordava, P\. Bojanowski, E\. Grave, T\. Linzen, and M\. Baroni \(2018\)Colorless green recurrent networks dream hierarchically\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 1195–1205\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1108)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Honnibal, I\. Montani, S\. Van Landeghem, and A\. Boyd \(2020\)SpaCy: industrial\-strength natural language processing in Python\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1212303)Cited by:[§5\.1](https://arxiv.org/html/2605.20529#S5.SS1.p3.1)\.
- P\. A\. Huebner, E\. Sulem, C\. Fisher, and D\. Roth \(2021\)BabyBERTa: learning more grammar with small\-scale child\-directed language\.InProceedings of the 25th Conference on Computational Natural Language Learning,pp\. 624–646\.Cited by:[§6](https://arxiv.org/html/2605.20529#S6.SS0.SSS0.Px3.p1.1)\.
- J\. Jumelet and D\. Hupkes \(2018\)Do language models understand anything? On the ability of LSTMs to understand negative polarity items\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,T\. Linzen, G\. Chrupała, and A\. Alishahi \(Eds\.\),Brussels, Belgium,pp\. 222–231\.External Links:[Link](https://aclanthology.org/W18-5424/),[Document](https://dx.doi.org/10.18653/v1/W18-5424)Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p2.1)\.
- A\. Karpathy \(2023\)NanoGPT\.Note:Computer softwareGitHubExternal Links:[Link](https://github.com/karpathy/nanogpt)Cited by:[§4\.2](https://arxiv.org/html/2605.20529#S4.SS2.p1.1)\.
- V\. Kempe, P\. J\. Brooks, and S\. Gillis \(2024\)Four decades of open language science: the CHILDES project\.Language Teaching Research Quarterly44,pp\. 15–30\.External Links:[Document](https://dx.doi.org/10.32038/ltrq.2024.44.04)Cited by:[Appendix E](https://arxiv.org/html/2605.20529#A5.p1.1)\.
- D\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations,Cited by:[§4\.3](https://arxiv.org/html/2605.20529#S4.SS3.p1.2)\.
- A\. Kuncoro, C\. Dyer, J\. Hale, D\. Yogatama, S\. Clark, and P\. Blunsom \(2018\)LSTMs can learn syntax\-sensitive dependencies well, but modeling structure makes them better\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1426–1436\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px2.p1.1)\.
- T\. K\. Landauer and S\. T\. Dumais \(1997\)A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge\.\.Psychological Review104\(2\),pp\. 211\.Cited by:[§6](https://arxiv.org/html/2605.20529#S6.SS0.SSS0.Px4.p1.1)\.
- O\. Lavi\-Rotbain and I\. Arnon \(2023\)Zipfian distributions in child\-directed speech\.Open Mind: Discoveries in Cognitive Science7,pp\. 1–30\.External Links:[Document](https://dx.doi.org/10.1162/opmi%5Fa%5F00070)Cited by:[§4\.1](https://arxiv.org/html/2605.20529#S4.SS1.p4.5)\.
- C\. S\. Leong and T\. Linzen \(2026\)Manipulating language models’ training data to study syntactic constraint learning: the case of English passivization\.Journal of Memory and Language149,pp\. 104751\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Linzen, E\. Dupoux, and Y\. Goldberg \(2016\)Assessing the ability of LSTMs to learn syntax\-sensitive dependencies\.Transactions of the Association for Computational Linguistics4,pp\. 521–535\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00115)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§4\.3](https://arxiv.org/html/2605.20529#S4.SS3.p1.2)\.
- B\. MacWhinney \(2000\)The CHILDES project: tools for analyzing talk\.Lawrence Erlbaum,Mahwah, NJ\.Cited by:[§5\.1](https://arxiv.org/html/2605.20529#S5.SS1.p1.1)\.
- M\. Maratsos and M\. A\. Chalkley \(1980\)The internal language of children’s syntax: the ontogenesis and representation of syntactic categories\.Children’s Language\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Marvin and T\. Linzen \(2018\)Targeted syntactic evaluation of language models\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 1192–1202\.Cited by:[§4\.4](https://arxiv.org/html/2605.20529#S4.SS4.p1.1)\.
- R\. T\. McCoy, R\. Frank, and T\. Linzen \(2020\)Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence\-to\-sequence networks\.Transactions of the Association for Computational Linguistics8,pp\. 125–140\.Cited by:[§3](https://arxiv.org/html/2605.20529#S3.p3.1)\.
- T\. Mikolov, I\. Sutskever, K\. Chen, G\. S\. Corrado, and J\. Dean \(2013\)Distributed representations of words and phrases and their compositionality\.Advances in Neural Information Processing Systems26\.Cited by:[§6](https://arxiv.org/html/2605.20529#S6.SS0.SSS0.Px4.p1.1)\.
- T\. H\. Mintz \(2003\)Frequent frames as a cue for grammatical categories in child directed speech\.Cognition90\(1\),pp\. 91–117\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Misra and N\. Kim \(2024\)Generating novel experimental hypotheses from language models: a case study on cross\-dative generalization\.arXiv preprint arXiv:2408\.05086\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Misra and K\. Mahowald \(2024\)Language models learn rare phenomena from less rare phenomena: the case of the missing AANNs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 913–929\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- J\. L\. Morgan and K\. Demuth \(1996\)Signal to syntax: bootstrapping from speech to grammar in early acquisition\.Psychology Press\.Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p1.1),[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Mueller, R\. Frank, T\. Linzen, L\. Wang, and S\. Schuster \(2022\)Coloring the blank slate: pre\-training imparts a hierarchical inductive bias to sequence\-to\-sequence models\.InFindings of the Association for Computational Linguistics: ACL 2022,pp\. 1352–1368\.Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p2.1)\.
- K\. Mulligan, R\. Frank, and T\. Linzen \(2021\)Structure here, bias there: hierarchical generalization by jointly learning syntactic transformations\.InProceedings of the Society for Computation in Linguistics 2021,pp\. 125–135\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Nozari and A\. Omaki \(2022\)Revisiting agreement: do children and adults compute subject\-verb agreement differently?\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.44\.Cited by:[§6](https://arxiv.org/html/2605.20529#S6.SS0.SSS0.Px3.p1.1)\.
- L\. Onnis, P\. Monaghan, M\. H\. Christiansen, and N\. Chater \(2004\)Variability is the spice of learning, and a crucial ingredient for detecting and generalizing in nonadjacent dependencies\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.26\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px4.p1.1)\.
- F\. Padovani, J\. Jumelet, Y\. Matusevych, and A\. Bisazza \(2025\)Child\-directed language does not consistently boost syntax learning in language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 19746–19767\.External Links:[Link](http://dx.doi.org/10.18653/v1/2025.emnlp-main.999),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.999)Cited by:[§6](https://arxiv.org/html/2605.20529#S6.SS0.SSS0.Px3.p1.1)\.
- A\. Patil, J\. Jumelet, Y\. Y\. Chiu, A\. Lapastora, P\. Shen, L\. Wang, C\. Willrich, and S\. Steinert\-Threlkeld \(2024\)Filtered corpus training \(FiCT\) shows that language models can generalize from indirect evidence\.Transactions of the Association for Computational Linguistics12,pp\. 1597–1615\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- L\. S\. Pearl and B\. Mis \(2016\)The role of indirect positive evidence in syntactic acquisition: a look at anaphoric one\.Language92\(1\),pp\. 1–30\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Pinker \(1984\)Language learnability and language development\.Harvard University Press\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- G\. K\. Pullum and B\. C\. Scholz \(2002\)Empirical assessment of stimulus poverty arguments\.The Linguistic Review19\(1\-2\),pp\. 9–50\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Qin, N\. Saphra, and D\. Alvarez\-Melis \(2025\)Data drives unstable hierarchical generalization in LMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 11722–11740\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.593/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.593),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.Note:OpenAIExternal Links:[Link](https://openai.com/blog/better-language-models/)Cited by:[§4\.2](https://arxiv.org/html/2605.20529#S4.SS2.p1.1)\.
- L\. Raviv, G\. Lupyan, and S\. C\. Green \(2022\)How variability shapes learning and generalization\.Trends in Cognitive Sciences26\(6\),pp\. 462–483\.External Links:[Document](https://dx.doi.org/10.1016/j.tics.2022.03.007)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.External Links:[Link](https://arxiv.org/abs/1706.03762)Cited by:[§4\.2](https://arxiv.org/html/2605.20529#S4.SS2.p1.1)\.
- J\. Wei, D\. Garrette, T\. Linzen, and E\. Pavlick \(2021\)Frequency effects on syntactic rule learning in transformers\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 932–948\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.72)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Wexler and P\. W\. Culicover \(1980\)Formal Principles of Language Acquisition\.MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p1.1),[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- E\. G\. Wilcox, R\. Futrell, and R\. Levy \(2024\)Using computational models to test syntactic learnability\.Linguistic Inquiry55\(4\),pp\. 805–848\.Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p2.1)\.
- E\. Wonnacott, E\. L\. Newport, and M\. K\. Tanenhaus \(2008\)Acquiring and processing verb argument structure: distributional learning in a miniature language\.Cognitive Psychology56\(3\),pp\. 165–209\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Yang \(2016\)The price of linguistic productivity: how children learn to break the rules of language\.MIT press\.Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Yang, A\. Bisazza, N\. Schneider, and E\. G\. Wilcox \(2026a\)A unified assessment of the poverty of the stimulus argument for neural language models\.External Links:2602\.09992,[Link](https://arxiv.org/abs/2602.09992)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Yang, H\. Getz, and E\. G\. Wilcox \(2026b\)From linear input to hierarchical structure: function words as statistical cues for language learning\.External Links:2601\.21191,[Link](https://arxiv.org/abs/2601.21191)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Yedetore and N\. Kim \(2024\)Semantic training signals promote hierarchical syntactic generalization in transformers\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4059–4073\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.235/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.235)Cited by:[§2](https://arxiv.org/html/2605.20529#S2.SS0.SSS0.Px1.p1.1)\.
- G\. K\. Zipf \(1949\)Human behavior and the principle of least effort: an introduction to human ecology\.Addison\-Wesley Press,Cambridge, MA\.Cited by:[§1](https://arxiv.org/html/2605.20529#S1.p8.3)\.
## Appendix ASample training sentences
Below are examples of sentences used in training for a few levels of variability\.
Maximally Variable\(α\\alpha= 0\):
- •the driver leads
- •by the solver the challenger trades
- •the dancers near the writers embezzle
- •by the twirlers the painters near the singers navigate
Moderately Variable\(α\\alpha= 1\.5\):
- •the hunters listen
- •by the builder the twirler collapses
- •by the swimmers the bridgers bridge
- •near the miner the jumper near the painter jumps
No Variability\(α→∞\\alpha\\to\\infty\):
- •the twirler twirls
- •by the charmers the miners mine
- •the builders near the protectors build
- •near the swimmers the lassoers by the bakers lasso
## Appendix BSpeaker role utterance counts
See Figure[5](https://arxiv.org/html/2605.20529#A2.F5)for utterance counts by each type of speaker in the CHILDES data that we analyzed\.
Figure 5:Distribution of utterances by speaker role in the English subset of CHILDES \(ages 0\-96 months\)\.
## Appendix CExamples of child\-directed language at varying child ages
Below are examples of child\-directed language spoken to children of varying ages\.
Age 0\-12 months \(α=1\.46\\alpha=1\.46\)
1. 1\.“you put the block on”
2. 2\.“what else do we see in here”
3. 3\.“oh be be very gentle with baby right”
Age 12\-24 months \(α=1\.40\\alpha=1\.40\)
1. 1\.“what’s that”
2. 2\.“yeah that’s where we were”
3. 3\.“you don’t like MacDonald’s and I don’t like MacDonald’s”
Age 24\-36 months \(α=1\.44\\alpha=1\.44\)
1. 1\.“how do you know this is a duck”
2. 2\.“this is velcro”
3. 3\.“let’s sit here on mama’s mama’s knee”
Age 36\-48 months \(α=1\.38\\alpha=1\.38\)
1. 1\.“you get milk from it”
2. 2\.“look at these”
3. 3\.“want mommy to read”
Age 48\-60 months \(α=1\.37\\alpha=1\.37\)
1. 1\.“i think we found the wheels or your mom did”
2. 2\.“just like we see up there remember”
3. 3\.“there’s something wrong with her teeth aren’t there”
Age 60\-72 months \(α=1\.28\\alpha=1\.28\)
1. 1\.“well I know but you know what I think this chair is”
2. 2\.“so you want listen come here I’m going to tell you”
3. 3\.“I don’t think I would like those”
Age 70\-84 months \(α=1\.23\\alpha=1\.23\)
1. 1\.“I never heard of that one before”
2. 2\.“dad’s gonna dads can do it a lot”
3. 3\.“there’s how many bears on one wheel’
Age 84\-96 months \(α=1\.25\\alpha=1\.25\)
1. 1\.“I got you a pencil”
2. 2\.“alright well if you don’t put it on then the letter’s no good”
3. 3\.“uh what about a movie though”
## Appendix DTraining loss
See Figure[6](https://arxiv.org/html/2605.20529#A4.F6)for the loss trajectories of the models we trained\.
Figure 6:Training and validation loss at threeα\\alphavalues\. Loss curves track closely across all conditions, indicating no overfitting\.
## Appendix EData Cleaning
We downloaded 5,147,586 utterances from participants categorized as English\-language speakers, restricted to 25 target speaker roles: Adult, Caretaker, Father, Friend, Grandfather, Grandmother, Investigator, Mother, Narrator, Playmate, Relative, Sibling, Sister, Brother, Teacher, Unidentified, Visitor, Teenager, Participant, Girl, Male, Student, Environment, Doctor, Target Adult\. Next, we removed rows with null text content, converted utterances to text strings, removed rows lacking a target child age, and removed rows where the target child age was greater than 96 months\. After cleaning, 4,739,189 utterances remained\. Of these, 59\.0% were spoken by mothers and 31\.6% by investigators, together comprising nearly 90% of all utterances as shown in Figure[5](https://arxiv.org/html/2605.20529#A2.F5)in Appendix[B](https://arxiv.org/html/2605.20529#A2)\. This proportion reflects the high concentration of speech from caregivers in the corpusKempeet al\.\([2024](https://arxiv.org/html/2605.20529#bib.bib54)\)\. During the subject\-verb extraction step, subject and verb lemmas were converted to lower case, and the resulting verb counts were restricted to ASCII English forms before selecting the top 100 verbs for analysis\.
## Appendix FAnalysis of subject\-verb pairings by age
See Table[3](https://arxiv.org/html/2605.20529#A6.T3)for statistics of subject\-verb pairings in child\-directed language broken down by the age of the child being spoken to\.
Table 3:Age\-stratified analysis of subject\-verb pairings in CHILDES \(0–96 months\)\. The Zipf parameterα\\alphadecreases from 1\.46 in the youngest age group to 1\.25 in the oldest\.Similar Articles
When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology
This paper investigates how character-level transformer models generalize to irregular verb subtypes in Japanese past-tense inflection. Controlled experiments show that including irregular examples can improve generalization, challenging the assumption that regularity simplifies learning.
What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
This paper investigates why chain-of-thought prompting improves language model accuracy at probe time, finding that gains arise primarily from local token co-occurrence and lexical activation rather than global logical derivation.
Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language
This paper applies successor representations from reinforcement learning to natural language, training a neural network to predict the expected distribution of future words. It shows that linguistic categories like parts of speech and lexical subclasses emerge spontaneously without explicit supervision.
Vokenization: Multimodel Learning for Vision and Language
The article explains 'Vokenization,' a multimodal learning technique that bridges computer vision and natural language processing by using weak supervision to link visual data with language tokens. It contrasts this approach with text-only models like GPT-3 and BERT, highlighting how visual grounding can improve language understanding.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
This paper introduces a meta-optimized approach for semantic visual decoding from fMRI signals that generalizes to novel subjects without fine-tuning, using in-context learning to infer unique neural encoding patterns from a small set of image-brain activation examples. The method achieves strong cross-subject and cross-scanner generalization without requiring anatomical alignment or stimulus overlap.