The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models
Summary
This paper investigates whether text-based LLMs and an ASR model store verb+up phrasal verbs holistically, finding that frequency and predictability drive holistic storage, supporting usage-based theories of language.
View Cached Full Text
Cached at: 06/15/26, 08:57 AM
# The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models
Source: [https://arxiv.org/html/2606.13993](https://arxiv.org/html/2606.13993)
Zachary N\. Houghton University of Oregon Vail Systems, Inc znh@uoregon\.edu&Yu Zhou Vail Systems, Inc &Dan Pluth Vail Systems, Inc &Vijay K\. Gurbani Vail Systems, Inc
###### Abstract
A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules\. While recent work has examined abstract knowledge in language models, holistic storage of multi\-word units has received far less attention\. We probe internal representations in text\-based LLMs and an ASR model, testing whether V\+*up*phrasal verbs develop distinct representations as a function of frequency and predictability\. All models show evidence of holistic storage driven by frequency and predictability, further supporting usage\-based theories of language\.
The Holistic Storage of Verb\+*Up*Phrases in Text\-based and Audio\-based Language Models
Zachary N\. HoughtonUniversity of OregonVail Systems, Incznh@uoregon\.eduYu ZhouVail Systems, IncDan PluthVail Systems, IncVijay K\. GurbaniVail Systems, Inc
## 1Introduction
A central debate in linguistics concerns how humans trade off between computation and storage\(Stemberger and MacWhinney,[2004](https://arxiv.org/html/2606.13993#bib.bib10),[1986](https://arxiv.org/html/2606.13993#bib.bib11); Kapatsinskiet al\.,[2009](https://arxiv.org/html/2606.13993#bib.bib38); Houghton and Morgan,[2023](https://arxiv.org/html/2606.13993#bib.bib12),[2024](https://arxiv.org/html/2606.13993#bib.bib13); Houghton,[2025b](https://arxiv.org/html/2606.13993#bib.bib14),[a](https://arxiv.org/html/2606.13993#bib.bib2); Morgan and Levy,[2016](https://arxiv.org/html/2606.13993#bib.bib15),[2024](https://arxiv.org/html/2606.13993#bib.bib16),[2015](https://arxiv.org/html/2606.13993#bib.bib19)\)\. Computation refers to applying abstract knowledge to generate new representations; for example, deriving*wugs*from*wug*via a productive pluralization rule\. Storage refers to retrieving a whole representation from memory rather than computing it, such as accessing a holistic representation of a common phrase \(e\.g\.,I don’t know\) despite being able to, in principle, generate the phrase compositionally\. Both mechanisms are clearly at work in language learning and processing, but the factors that drive what forms are produced via computation and what forms are stored and retrieved holistically remains poorly understood\.
### 1\.1Computation vs Storage in Humans
There is an abundance of evidence for abstract knowledge in language use by humans\. Children generalize morphological rules productively to novel words they have never encountered\(Berko,[1958](https://arxiv.org/html/2606.13993#bib.bib20)\), and priming studies show that sentences sharing syntactic structure or semantically related words facilitate each other’s processing, implying that something abstract is shared across their representations\(Bock,[1986](https://arxiv.org/html/2606.13993#bib.bib21); Meyer and Schvaneveldt,[1971](https://arxiv.org/html/2606.13993#bib.bib23)\)\. Similarly, when ordering novel or low\-frequency binomials, humans rely on abstract preferences \(e\.g\., preferring the shorter word first\) rather than simply producing the more frequent ordering\(Morgan and Levy,[2016](https://arxiv.org/html/2606.13993#bib.bib15)\)\.
Storage has had a more controversial history\. Early accounts held that only irregular forms \(e\.g\.,went\) were stored holistically and that any compositionally derivable form was derived via computation\(Pinker and Ullman,[2002](https://arxiv.org/html/2606.13993#bib.bib24)\)\. Evidence against this view has since accumulated\(Kapatsinskiet al\.,[2009](https://arxiv.org/html/2606.13993#bib.bib38); Bybee and Scheibman,[1999](https://arxiv.org/html/2606.13993#bib.bib25); Morgan and Levy,[2016](https://arxiv.org/html/2606.13993#bib.bib15); Houghton,[2025a](https://arxiv.org/html/2606.13993#bib.bib2); Stemberger and MacWhinney,[2004](https://arxiv.org/html/2606.13993#bib.bib10)\)\.Stemberger and MacWhinney \([2004](https://arxiv.org/html/2606.13993#bib.bib10)\)showed that inflection errors are less common for high\-frequency words, suggesting holistic storage even for regularly derived forms\.Bybee and Scheibman \([1999](https://arxiv.org/html/2606.13993#bib.bib25)\)demonstrated that*don’t*is phonetically more reduced in high\-frequency phrases like*I don’t know*than in lower\-frequency phrases like*I don’t go*; if*don’t*always had the same representation, such context\-specific reduction would be difficult to explain\.
Processing studies have provided converging evidence\.Morgan and Levy \([2016](https://arxiv.org/html/2606.13993#bib.bib15)\)found that while human ordering preferences for low\-frequency binomials are driven by abstract preferences, preferences for high\-frequency binomials are driven by item\-specific preferences, suggesting a frequency\-dependent shift from computation to storage\. Most directly relevant to the present study,Kapatsinskiet al\.\([2009](https://arxiv.org/html/2606.13993#bib.bib38)\)found that listeners are slower to detect*up*in high\-frequency Verb\+*up*\(V\+*up*\) phrases \(e\.g\.,pick up\) than in lower\-frequency ones\. Specifically, they tasked participants with pressing a button when they heardup, which occurred either within a word or within a V\+upphrase\. They found that while participants were faster to press the button when responding to medium\-frequency phrases relative to low\-frequency phrases, they were slower to press the button when the verb\+upphrases were high frequency\. Further,Houghton \([2025a](https://arxiv.org/html/2606.13993#bib.bib2)\)found that this pattern holds for high\-predictability phrases \(V\+upphrases whereupis likely to appear given the verb\), suggesting that high\-frequency and high\-predictability V\+*up*phrases are represented holistically\.
### 1\.2Computation vs Storage in LMs
Whether language models exhibit analogous computation\-storage tradeoffs to humans has become an active area of inquiry\. On the computation side, results are mixed: some studies find that models learn abstract generalizations not present in training\(Misra and Mahowald,[2024](https://arxiv.org/html/2606.13993#bib.bib28); Yaoet al\.,[2025](https://arxiv.org/html/2606.13993#bib.bib29); Lasriet al\.,[2022](https://arxiv.org/html/2606.13993#bib.bib27)\), while others find that models fail to use abstract knowledge where humans do, such as morphological generalization to novel words\(Haley,[2020](https://arxiv.org/html/2606.13993#bib.bib30)\)\. On the storage side, there is no doubt that models rely heavily on memorization\(McCoyet al\.,[2023](https://arxiv.org/html/2606.13993#bib.bib31)\)\. Indeed, item\-specific frequency effects have been documented even in tasks where humans show abstract preferences\(Houghtonet al\.,[2025](https://arxiv.org/html/2606.13993#bib.bib26)\)\.
Whether language models develop holistic phrasal representations in a similar manner as humans is an open question\. If they do, this would suggest that holistic storage is a natural consequence of learning from distributional patterns in the language, requiring no special storage mechanism, and lending support to usage\-based accounts of how computation and storage interact\. Audio\-based models are especially well\-suited to address this question because a good deal of the evidence for holistic storage comes from listening paradigms\. It is thus important to understand whether LLMs and automatic speech recognition \(ASR\) models both show similar representations\. And yet, holistic phrasal storage has never been examined in any such ASR model\. More broadly, holistic phrasal storage in language models has rarely been examined in any modality, and it has not been studied in models trained on human\-comparable amounts of data\.
### 1\.3Present Study
The present study addresses this gap\. We probe internal representations in text\-based language models that were trained on an amount of data comparable to humans \(BabyLMs\),111The model was trained on 150 million tokens\. The average college\-aged human experiences 3̃50 million words\(Levyet al\.,[2012](https://arxiv.org/html/2606.13993#bib.bib35)\)so the model is trained on a little less than half the tokens that an average college\-aged human has experienced\.a large language model \(OLMo\-3 7B\), and an audio\-based speech recognition model \(Whisper\-small\)\. These models were chosen to help illuminate effects of training size, number of parameters, and modality \(speech vs text\)\. In order to probe their representations, we trained logistic classifiers to detect the embedding of standalone*up*, then tested the classifier on V\+*up*phrases of varying frequency and predictability\. If these models develop holistic phrasal representations, the representation of*up*in high\-frequency and high\-predictability V\+*up*phrases should diverge from that of standalone*up*more than in lower\-frequency and lower\-predictability phrases, resulting in lower logit scores from the classifier\. Our specific contributions are:
- •We train and release three open\-access autoregressive models trained on the BabyLM v3 corpus\(Charpentieret al\.,[2025](https://arxiv.org/html/2606.13993#bib.bib5)\), checkpointed every 20M tokens, to facilitate future research on human\-scale language learning\.
- •We show that holistic phrasal storage emerges in both text\-based LLMs and an ASR model, establishing that frequency\- and predictability\-driven holistic representations arise even from models trained on an amount of data comparable to humans, and even across modalities\.
- •We show that frequency effects on phrasal storage are robust across model sizes, but predictability effects strengthen with scale, suggesting that sensitivity to co\-occurrence statistics beyond raw frequency requires greater representational capacity\.
## 2Model Training
In order to examine holistic storage in models that have seen an amount of data comparable to humans, we trained three autoregressive language models on the BabyLM v3 corpus\(Charpentieret al\.,[2025](https://arxiv.org/html/2606.13993#bib.bib5)\), a 150M\-token dataset designed to be more reflective of the scale and quality of the language that humans receive\.222Though it is worth noting, as has been pointed out before\(e\.g\., Houghton,[2025a](https://arxiv.org/html/2606.13993#bib.bib2)\), that it may be misleading to compare the tokens that LLMs receive to the “tokens” that humans receive, since humans encounter language in a context\-rich environment while LLMs see only the raw text\.All three models follow the OPT decoder\-only transformer architecture\(Zhanget al\.,[2022](https://arxiv.org/html/2606.13993#bib.bib9)\), with 125M, 350M, and 1\.3B parameters, respectively\.
Prior to training, we fit a byte\-pair encoding \(BPE\) tokenizer directly on the BabyLM training corpus, yielding a vocabulary of 8,192 subword types\. This tokenizer was shared across all three model sizes, ensuring that cross\-model comparisons are not confounded by differences in tokenization\. A full description of the model training is included in the Appendix \(Section[A](https://arxiv.org/html/2606.13993#A1)\)\.
## 3Experiment 1: UP Independently
Experiment 1 tests whether LLMs develop holistic phrasal representations analogous to those proposed by usage\-based theories\(e\.g\., Kapatsinskiet al\.,[2009](https://arxiv.org/html/2606.13993#bib.bib38); Houghton,[2025a](https://arxiv.org/html/2606.13993#bib.bib2)\), using classifiers trained on standalone*up*representations and applied to V\+*up*phrases varying in frequency and predictability\. We investigate this across OLMo\-3 7B and the three BabyLMs \(OPT\-125M, 350M, and 1\.3B\)\.
### 3\.1Methods
For each model \(OLMo\-3 7B and BabyLM OPT\-125M, OPT\-350M, and OPT\-1\.3B\), we extracted the representation of the token*up*from each hidden layer of the models for each sentence\. Using these representations, we trained a separate logistic regression classifier for each layer to distinguish standalone*up*\(which did not occur in V\+*up*phrases in the training of the classifier\) from other tokens in the same sentence \(these other tokens did not contain the segment*up*in any capacity\); the training items are described in the next section\. The trained classifier was then tested on a held\-out test set comprising V\+*up*phrases, and for each sentence the classifier returned a logit score reflecting the predicted probability that the representation of*up*in the V\+*up*phrase resembled the standalone class\.
Frequency for each V\+*up*type is operationalized as its raw corpus count, log\-transformed\. Lettingcvup=count\(V\+up\)c\_\{vup\}=\\text\{count\}\(V\\text\{\+\{up\}\}\):
log\-frequency=log\(cvup\)\{\\text\{log\-frequency\}=\\log\\bigl\(c\_\{vup\}\)\}\(1\)
Predictability is operationalized as the log\-odds ratio of V\+*up*occurrences to V not followed by up in the corpus\. LettingcV=count\(V\)c\_\{V\}=\\text\{count\}\(V\):
log\-predictability=log\(cvupcV−cvup\)\{\\text\{log\-predictability\}=\\log\\\!\\left\(\\frac\{c\_\{vup\}\}\{c\_\{V\}\-c\_\{vup\}\}\\right\)\}\(2\)
Counts were derived from the training corpus of each model: the BabyLM V3 dataset\(Charpentieret al\.,[2025](https://arxiv.org/html/2606.13993#bib.bib5)\)for the BabyLM models, and Dolma v1\.7 \(queried via the infini\-gram API;Liuet al\.[2024](https://arxiv.org/html/2606.13993#bib.bib7)\) for OLMo\-3 7B\. Although OLMo\-3 7B was trained on Dolma 3, Dolma v1\.7 is the most recent snapshot indexed by infini\-gram and draws from the same underlying sources, making it a reasonable approximation of OLMo\-3 7B’s training distribution\.333The relative frequencies and co\-occurrence patterns of common English phrasal verbs are unlikely to differ substantially between Dolma versions, as both are large\-scale web\-text corpora of similar composition\.
#### 3\.1\.1Classifier Training
The classifier was trained to distinguish the language models’ representations of the preposition*up*from representations of other tokens in the same sentence\. Positive training examples consisted of 1,000 occurrences of*up*which occurred in sentences strictly as a preposition,444We used a morphological parser to filter out sentences in which*up*was not tagged as a preposition\.drawn from sentences in the C4 corpus\(Raffelet al\.,[2020](https://arxiv.org/html/2606.13993#bib.bib6)\)\. Negative examples were 1,000 tokens randomly selected from the same sentences, restricted to tokens whose decoded string consists entirely of alphabetic characters \(no numbers, punctuation, or special characters\), and excluding the preposition*up*itself and any token containing*up*as a substring\. The validation set was drawn from the same pool of sentences \(a non\-overlapping subset\), with token positions resolved per model’s tokenizer; exact counts fall slightly below 1,000 for some models because the tokenizer does not always produce an isolated*up*token\. For example, when*up*appears before punctuation and the tokenizer fuses the two into a single unit \(e\.g\.,*up,*\), there is no isolable*up*position to extract, so that instance is excluded\. The classifiers for all models were trained on the same underlying sentences\.
The test set comprised V\+*up*phrases \(e\.g\.,*pick up*\) with at least 20 occurrences in the corpus; up to 20 sentences were sampled per type\. Because the BabyLM corpus is smaller than Dolma, fewer V\+*up*types attain a valid \(non\-zero\) predictability estimate; only types with a valid estimate are included in the analyses\. Full item\-level statistics are reported in the Appendix \(Table[3](https://arxiv.org/html/2606.13993#A3.T3)\)\. For OLMo\-3 7B, 4,081 unique V\+*up*types were included, with a median corpus frequency of 43,608 and a median log\-predictability of−5\.39\-5\.39\. The BabyLM models shared 1,039 items, with a lower median frequency \(34\) and log\-predictability \(−3\.40\-3\.40\), reflecting the smaller vocabulary of the BabyLM training data\. Classifier training and validation split sizes are shown in the Appendix \(Table[4](https://arxiv.org/html/2606.13993#A4.T4)\)\.
#### 3\.1\.2Analyses
In order to examine the effects of frequency and predictability on the representations of*up*in V\+*up*phrases, we implemented a Bayesian mixed\-effects regression model using the*brms*package\(Bürkner,[2017](https://arxiv.org/html/2606.13993#bib.bib3)\)\. For each statistical analysis, the outcome variable was the logit score returned by the classifier on each test item\. We included fixed\-effects for frequency and predictability, both of which were centered and scaled \(denotedc\_log\_freqc\\\_\\text\{log\\\_freq\}andc\_log\_predicc\\\_\\text\{log\\\_predic\}below\)\. We also included random intercepts for phrasal verb type \(since there were multiple sentences for each phrasal verb\)\. Weak, uninformative priors were included on each fixed\-effect\. The model syntax is included below:
logit∼c\_log\_freq×c\_log\_predic\\displaystyle\\sim c\\\_\\text\{log\\\_freq\}\\times c\\\_\\text\{log\\\_predic\}\(3\)\+\(1∣verb\_up\)\\displaystyle\\quad\+\(1\\mid\\text\{verb\\\_up\}\)
Separate statistical models were fit for each language model at the final hidden layer\. For each coefficient we report the posterior mean, standard error, 95% credible interval \(CI\), and the proportion of posterior samples greater than zero \(% \> 0\)\. We consider an effect to be meaningful if the 95% CI does not contain zero\.555Though asHoughtonet al\.\([2024](https://arxiv.org/html/2606.13993#bib.bib36)\)point out, unlike frequentist statistics, Bayesian analyses don’t force us to commit to a binary of significance vs\. non\-significance, and the percentage of samples greater than zero can be interpreted in a continuous manner\.
In order to examine the difference in representations across hidden layers, we additionally fit a generalized additive model\(GAM, Wood,[2017](https://arxiv.org/html/2606.13993#bib.bib8)\)\. As with the Bayesian analyses, the dependent variable was the classifier logit\. Specifically, two models were fit separately, one for frequency and one for predictability, each with a tensor product smooth over the predictor and hidden layer index \(with a a random intercept for verb\):
logit∼te\(log\_freq,layer\)\\displaystyle\\sim\\mathrm\{te\}\(\\text\{log\\\_freq\},\\,\\text\{layer\}\)\(4\)\+s\(verb\_up,bs=’re’\)\\displaystyle\\quad\+s\(\\text\{verb\\\_up\},\\,\\text\{bs\}\{=\}\\texttt\{'re'\}\)logit∼te\(log\_predic,layer\)\\displaystyle\\sim\\mathrm\{te\}\(\\text\{log\\\_predic\},\\,\\text\{layer\}\)\+s\(verb\_up,bs=’re’\)\\displaystyle\\quad\+s\(\\text\{verb\\\_up\},\\,\\text\{bs\}\{=\}\\texttt\{'re'\}\)
### 3\.2Results
Full numerical results are reported in the Appendix \(Table[5](https://arxiv.org/html/2606.13993#A4.T5)for the Bayesian model and Table[6](https://arxiv.org/html/2606.13993#A4.T6)for the GAM\); they are visualized in Figure[1](https://arxiv.org/html/2606.13993#S3.F1)and Figure[2](https://arxiv.org/html/2606.13993#S3.F2)\. Figure[8](https://arxiv.org/html/2606.13993#A4.F8)\(Appendix\) shows the layer\-by\-layer difference between high\- and low\-predictor items more directly\.
For all four LLMs, there was a negative effect of both frequency and predictability on the classifier’s logit score\. In other words, both frequency and predictability result in the representation of*up*in V\+*up*phrases being less similar to the standalone representation of*up*\.
Additionally, the by\-layer analysis demonstrates that this divergence begins to appear early on in the larger models \(specifically, OLMo\-3 7B and BabyLM 1\.3B show this early effect\), while the smaller models do not show an effect until later layers\. This suggests that for larger models, the representations of high\-frequency V\+*up*phrases are more distinct from the representations of the individual constituent words \(in the case of the present experiment, the representation of standalone*up*\) than for smaller models\.
Figure 1:Final\-layer brms predicted logit by frequency \(left\) and predictability \(right\) for all models \(UP independently\)\. Shading indicates 95% CIs\.Figure 2:GAM\-predicted logit by layer and predictor for all models \(UP independently\)\. Top: frequency; bottom: predictability\.
### 3\.3Discussion
Experiment 1 found that the logits from a classifier trained on standalone*up*representations decreased for high\-frequency and high\-predictability V\+upphrases\. Additionally, the effect of predictability appears to be weaker in the smaller models\. These results replicate those ofKapatsinskiet al\.\([2009](https://arxiv.org/html/2606.13993#bib.bib38)\)andHoughton \([2025a](https://arxiv.org/html/2606.13993#bib.bib2)\), who examined this in humans\. In Experiment 2, we more closely replicate these studies by also examining representations of*up*as a subword unit\.
## 4Experiment 2: UP as Subword
One possible criticism of the results of Experiment 1 is that the lower logit score for high\-frequency/high\-predictability V\+upphrases could reflect semantic bleaching rather than holistic storage\. That is, high\-frequency items often undergo semantic bleaching or extension\(e\.g\., Harmon and Kapatsinski,[2017](https://arxiv.org/html/2606.13993#bib.bib37)\), resulting in a widening of the phrase’s meaning\. For example, in English the wordliterallyhas been widened beyond its original meaning \(e\.g\., "I’m literally dying", whereliterallybehaves as an emphatic\)\. To address this concern, we replicate the design of Experiment 1 with one key change: rather than training the classifier on the prepositionupexclusively, we additionally train it on instances of*up*embedded within a larger word \(e\.g\.,*update*\), requiring the model to identify the sequence*up*regardless of whether it functions as a preposition\. This tests whether the frequency and predictability effects observed in Experiment 1 generalize to the*up*\-segment more broadly\. This design is also arguably more faithful toKapatsinskiet al\.\([2009](https://arxiv.org/html/2606.13993#bib.bib38)\)andHoughton \([2025a](https://arxiv.org/html/2606.13993#bib.bib2)\)’s studies, where participants were tasked with recognizing*up*in general, regardless of whether it occurred within a word or as a preposition\.
### 4\.1Methods
The procedure was identical to Experiment 1 \(same models; classifiers trained and evaluated at each hidden layer; same test set, outcome variable, and statistical approach\) with one difference: the classifier was trained on a broader set of positive examples that included instances of*up*embedded within a larger word \(e\.g\.,*update*,*upon*\)\.
#### 4\.1\.1Classifier Training
The training set for this experiment combined two types of positive example: 1,000 occurrences of standalone*up*\(identical to Experiment 1\) and 1,000 occurrences of*up*embedded within a larger word \(e\.g\.,*update*,*upon*, etc\)\. Critically, the subword positives were restricted tounique word types: each up\-containing word contributed exactly one instance, so the classifier could not learn to recognise a high\-frequency form like*upon*from repeated exposure\. Negative examples \(1,000 drawn from each sentence pool\) were tokens consisting entirely of alphabetic characters that did not contain*up*as a substring\. The validation set was constructed by the same procedure \(targeting 1,000 positive, 1,000 negative per sentence pool; exact counts vary by tokenizer\)\. As in Experiment 1, each model used the same underlying sentences with their own tokenizers\. The test set was identical to Experiment 1\. Classifier training/validation split sizes are shown in the Appendix \(Table[7](https://arxiv.org/html/2606.13993#A5.T7)\)\.
#### 4\.1\.2Analyses
Analyses were identical to those of Experiment 1, using the same model formula \(Equation[3](https://arxiv.org/html/2606.13993#S3.E3)\), predictor definitions \(Equation[1](https://arxiv.org/html/2606.13993#S3.E1), Equation[2](https://arxiv.org/html/2606.13993#S3.E2)\), priors, and random\-effects structure\. The brms models were fit separately for OLMo\-3 7B and each BabyLM model at the final transformer layer\. Across\-layer GAMs were fit using the same approach as Experiment 1 \(Equation[4](https://arxiv.org/html/2606.13993#S3.E4)\)\.
### 4\.2Results
Full numerical results are reported in the Appendix \(Table[8](https://arxiv.org/html/2606.13993#A5.T8)for the Bayesian model and Table[9](https://arxiv.org/html/2606.13993#A5.T9)for the GAMs\); they are visualized in Figure[3](https://arxiv.org/html/2606.13993#S4.F3)and Figure[4](https://arxiv.org/html/2606.13993#S4.F4)\. Figure[9](https://arxiv.org/html/2606.13993#A5.F9)\(Appendix\) shows the layer\-by\-layer difference more directly\. Frequency effects replicate Experiment 1 across all four models\. Predictability effects are more variable: BabyLMs show absent or positive effects, OLMo\-3 7B shows progressively stronger negative effects, consistent with predictability sensitivity increasing with scale \(though in the present study, it’s unclear if it’s due to the scale of data or scale of the model, or both\)\. The by\-layer patterns similarly show a progressively stronger effect of frequency for later layers relative to earlier layers, while the effect of predictability follows a similar pattern for OLMo, but not for the BabyLMs\. For the BabyLMs, the effect of predictability across layers either stays relatively constant across layers or, in the case of the 1\.3B model, becomes negative around layer 5 and then slowly grows positive again\.
Figure 3:Final\-layer brms predicted logit by frequency \(left\) and predictability \(right\) for all models \(UP as subword\)\. Shading indicates 95% CI\.Figure 4:GAM\-predicted logit as a function of transformer layer \(x\-axis\) and predictor value \(y\-axis\) for each model \(UP as subword\)\. Top row: log\-frequency; bottom row: log\-predictability\. Color encodes the predicted logit score \(random effects excluded\)\.
### 4\.3Discussion
The results of Experiment 2 mostly replicate those of Experiment 1\. Frequency shows a strong, consistent effect such that the representation ofupin high\-frequency V\+*up*phrases diverge from representations of*up*, even as a subword unit\. The effect of predictability seems to emerge in larger models, suggesting that sensitivity to predictability may require more representational capacity than sensitivity to frequency\. This is perhaps unsurprising: frequency is directly recoverable from surface co\-occurrence counts, while predictability requires tracking the conditional distribution of the particle given the verb\. This results in it being a more relational statistic that may only become reliably encoded at larger scale\.
## 5Experiment 3: Whisper \(ASR Model\)
In Experiment 3, we apply the same classifier approach to Whisper\-small, an automatic speech recognition \(ASR\) model trained on spoken audio rather than written text\. Whisper differs from the models in Experiments 1 and 2 in both its training modality and its encoder\-decoder architecture\. Examining Whisper allows us to ask whether the frequency and predictability effects on phrasal representations generalize beyond text\-based models to representations learned from speech\.
### 5\.1Methods
The procedure mirrored Experiments 1 and 2, with two key differences: \(1\) the model was Whisper\-small, an ASR model with an encoder\-decoder architecture trained on spoken audio rather than text; and \(2\) the stimuli were spoken audio segments rather than written sentences\. For each segment, we ran the audio through Whisper and extracted the hidden\-state representation of*up*at each layer of the encoder and decoder separately\. A logistic regression classifier was trained independently for each component \(encoder, decoder\) on spoken instances of standalone*up*\(see Classifier Training below\), then applied to spoken V\+*up*phrasal verb segments\. For the sake of brevity, we only trained the classifier to differentiate between Whisper’s embeddings of*up*as a standalone preposition and non\-*up*tokens\.
#### 5\.1\.1Classifier Training
The Whisper experiment used spoken language data drawn from a subset of the GigaSpeech corpus\(Chenet al\.,[2021](https://arxiv.org/html/2606.13993#bib.bib33)\)containing 224,118 timestamped speech segments in which*up*occurred\. Word\-level timestamps within each segment were obtained using WhisperX forced alignment\. Each occurrence of*up*was classified by running spaCy part\-of\-speech tagging on the full segment transcript: if the token immediately preceding*up*was tagged as a VERB, the instance was labelled*V\+up*\(e\.g\.,*pick up*,*clean up*\)\. This classification procedure mirrors the one used to identify standalone*up*in Experiment 1\. Of the 224,118 segments that containedup, 165,367 were classified as*V\+up*, spanning 4,161 unique V\+*up*types\.
Thetraining and validation setsused 1,000 standalone*up*instances each as positive examples, paired with 1,000 randomly selected non\-*up*words from the same segment as negative examples\. This mirrors the standalone\-*up*training design of Experiment 1\. Negative words were restricted to those whose string consists entirely of alphabetic characters \(matching the criteria used in Experiments 1 and 2\), with a minimum duration of 20 ms\. Training and validation segments were non\-overlapping\.
Thetest setcomprised V\+*up*types with at least 5 occurrences in the audio dataset, with up to 20 instances per type\. Corpus frequency and predictability were drawn from Dolma v1\.7 \(via infini\-gram\), identical to OLMo\-3 7B\. After filtering to types with a valid predictability estimate, 1,426 unique V\+*up*types were retained \(median corpus frequency: 243,412; median log\-predictability:−4\.08\-4\.08\)\. The high median frequency reflects the audio coverage threshold: types requiring at least 5 spoken occurrences are predominantly common V\+*up*phrases \(e\.g\.,*pick up*,*end up*,*set up*\) that are also extremely frequent in Dolma\. Despite the high median\-frequency, the test items still spanned a wide range of frequencies and thus we still have the statistical power to detect effects of frequency\. Split sizes and item\-level statistics are included in the Appendix \(Table[10](https://arxiv.org/html/2606.13993#A6.T10)and Table[3](https://arxiv.org/html/2606.13993#A3.T3)\)\.
#### 5\.1\.2Analyses
Analyses followed the same procedure as Experiments 1 and 2, using the same model formula \(Equation[3](https://arxiv.org/html/2606.13993#S3.E3)\) and predictor definitions \(Equation[1](https://arxiv.org/html/2606.13993#S3.E1), Equation[2](https://arxiv.org/html/2606.13993#S3.E2)\)\. The brms models were fit separately for Whisper’s encoder and decoder at the final hidden layer\. Corpus statistics were drawn from Dolma v1\.7 \(queried via the infini\-gram API\) for both components\. Across\-layer GAMs were fit using the same approach as Experiment 1 \(Equation[4](https://arxiv.org/html/2606.13993#S3.E4)\)\.
### 5\.2Results
Full numerical results are reported in the Appendix \(Table[11](https://arxiv.org/html/2606.13993#A6.T11)for the Bayesian model and Table[12](https://arxiv.org/html/2606.13993#A6.T12)for the GAM model\); they are visualized in Figure[5](https://arxiv.org/html/2606.13993#S5.F5)and Figure[6](https://arxiv.org/html/2606.13993#S5.F6)below\. Figure[10](https://arxiv.org/html/2606.13993#A6.F10)\(Appendix\) shows the layer\-by\-layer difference between high\- and low\-predictor items more directly\.
The results for Whisper show a similar trend across both the encoder and decoder\. Specifically, we see a negative effect for frequency and predictability across both\.
The by\-layer analysis demonstrates that the negative effects in the encoder are variable across layers, with predictability staying mostly negative throughout, while frequency starts out negative, rises throughout the middle layers, then becomes negative again at later layers\. On the other hand, the effects steadily decrease throughout the decoder for both frequency and predictability\. It’s unclear why the effect of frequency starts negative and then becomes less negative towards the middle layers in the encoder\. One possible interpretation is that high\-frequency segments are often reduced\(e\.g\., Bybee and Scheibman,[1999](https://arxiv.org/html/2606.13993#bib.bib25)\): these segments, in terms of acoustics, would be less similar to standalone*up*than low\-frequency \(un\-reduced\) forms\. The middle layers may be able to abstract across this, reducing the frequency\-effect on representation\. The decrease in later layers, then, reflects holistic storage \(as opposed to phonetic reduction\)\. Future research is needed to confirm this\.
Figure 5:Final\-layer brms predicted logit by frequency \(left\) and predictability \(right\) for Whisper encoder and decoder\. Shading indicates 95% CI\.Figure 6:GAM\-predicted logit by layer and predictor for Whisper encoder and decoder\. Top: frequency; bottom: predictability\. Color = logit; darker = lower logit \(less similar to standalone*up*\); random effects excluded\.
### 5\.3Discussion
The results of Whisper demonstrate an interesting similarity between audio speech recognizers and text\-based language models: both the encoder and the decoder of Whisper show similar patterns to the LLMs \(high\-predictability and high\-frequency V\+*up*phrases showing lower classifier logit scores\)\. This is especially surprising for the encoder, which is traditionally thought to consist of mostly acoustic representations\(though see Pluthet al\.,[2026](https://arxiv.org/html/2606.13993#bib.bib34), for evidence that the encoder encodes semantics and other higher\-level linguistic information\)\.
## 6Conclusion
The present study demonstrates that both text\-based LLMs and ASR models represent high\-frequency and high\-predictability V\+*up*phrases holistically, with predictability effects emerging as a function of model size\. Notably, this holds even in models trained on an amount of data comparable with humans, suggesting that holistic storage does not require the massive training corpora typical of modern LLMs\. Further, the effect of predictability is clearer for larger models, suggesting that increasing the number of parameters facilitates predictability effects\. Finally, the results suggest that for larger models, these effects emerge in earlier layers, while for smaller models the effect emerges in later layers\. The Whisper results add a further nuance: the effect of holistic storage is similar for both the encoder and decoder, adding to the growing body of literature suggesting the rich representations contained within the encoder\(Pluthet al\.,[2026](https://arxiv.org/html/2606.13993#bib.bib34)\)\.
There is further reason to find interest in the similarity between the results for Whisper and the text\-based LLMs: in LLMs, the token for*up*is identical across all of our items in testing\. On the other hand, since Whisper takes audio as its input, each*up*token is distinct \(the audio representations are slightly different\)\. These differences between the input types could in theory result in different representational behavior between the models\. Interestingly, however, both models employ similar strategies\. It’s unclear why exactly this is the case, but one possibility is that the encoder is abstracting away some of the acoustic variability, and representing*up*similarly enough that the model is able to treat these as a single token\.
The by\-layer results suggest that holistic storage is better understood as a spectrum than a binary\. Representations that diverge only in later layers may reflect a form of partial holistic storage, where lower\-level representations remain more compositional but higher\-level abstractions increasingly integrate item\-specific information\. Larger models, which have encountered V\+*up*phrases more frequently, show more divergent representations even in early hidden layers, suggesting that with sufficient exposure or capacity, storage\-like behavior becomes a more pervasive organizational principle\.
Holistic storage is itself a form of exemplar\-based representation: rather than computing a phrase from its parts each time, the whole phrase is represented as a stored unit\. If holistic storage and grammatical abstraction arise from the same process, then the traditional distinction between stored items and productive rules may be a post\-hoc description of a continuous representational landscape rather than a reflection of genuinely distinct cognitive processes\. The gradient by\-layer results support this view: rather than a sharp transition from compositional to holistic representations, we observe a progressive divergence that deepens across layers and strengthens with frequency and predictability\. This is more consistent with a view in which storage\-like behavior is one end of a continuum than with a view in which some items are stored and others are computed\.
Overall, the present study demonstrates another dimension upon which the behavior of transformer models, across modalities, resemble that of humans, providing further evidence for usage\-based theories\. Specifically, usage\-based accounts predict that holistic storage should be a gradient, frequency\- and predictability\-driven phenomenon\(e\.g\., Bybee and Scheibman,[1999](https://arxiv.org/html/2606.13993#bib.bib25); Kapatsinskiet al\.,[2009](https://arxiv.org/html/2606.13993#bib.bib38); Houghton,[2025a](https://arxiv.org/html/2606.13993#bib.bib2)\)\. The present results confirm this: storage\-like representations emerge gradually as a function of frequency and predictability, and they arise in models that have no explicit storage mechanism and, in the case of the BabyLMs, have seen no more data than a human\. More broadly, the results suggest that transformer architectures may serve as a productive modeling framework for investigating not just whether human\-like storage patterns emerge, but*how*and*when*they do so, and at a level of mechanistic detail that behavioral data alone cannot provide\.
## 7Limitations
The primary limitation is that we examined only one construction \(V\+*up*phrases\) and in one language \(English\)\. We look forward to expanding the present study to other languages and other constructions in future work\. We also only examined representations at the final checkpoint; however, we look forward to examining the emergence of storage as a function of training dynamics in the future\.
## References
- The child’s learning of english morphology\.*WORD*14\(2\-3\),pp\. 150–177\.External Links:[Document](https://dx.doi.org/10.1080/00437956.1958.11659661),[Link](http://www.tandfonline.com/doi/full/10.1080/00437956.1958.11659661)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p1.1)\.
- J\. K\. Bock \(1986\)Syntactic persistence in language production\.Cognitive Psychology18\(3\),pp\. 355–387\.External Links:[Document](https://dx.doi.org/10.1016/0010-0285%2886%2990004-6),[Link](https://www.sciencedirect.com/science/article/pii/0010028586900046)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p1.1)\.
- P\. Bürkner \(2017\)Brms: an r package for bayesian multilevel models using stan\.Journal of Statistical Software80\(1\),pp\. 1–28\.External Links:[Document](https://dx.doi.org/10.18637/jss.v080.i01)Cited by:[§3\.1\.2](https://arxiv.org/html/2606.13993#S3.SS1.SSS2.p1.2)\.
- J\. Bybee and J\. Scheibman \(1999\)The effect of usage on degrees of constituency: the reduction of don’t in english\.Linguistics37\(4\)\.External Links:[Document](https://dx.doi.org/10.1515/ling.37.4.575),[Link](https://www.degruyter.com/document/doi/10.1515/ling.37.4.575/html)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.13993#S5.SS2.p3.1),[§6](https://arxiv.org/html/2606.13993#S6.p5.1)\.
- L\. Charpentier, L\. Choshen, R\. Cotterell, M\. O\. Gul, M\. Hu, J\. Jumelet, T\. Linzen, J\. Liu, A\. Mueller, C\. Ross,et al\.\(2025\)Babylm turns 3: call for papers for the 2025 babylm workshop\.arXiv preprint arXiv:2502\.10645\.Cited by:[1st item](https://arxiv.org/html/2606.13993#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.13993#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.13993#S3.SS1.p6.1)\.
- G\. Chen, S\. Chai, G\. Wang, J\. Du, W\. Zhang, C\. Weng, D\. Su, D\. Povey, J\. Trmal, J\. Zhang, M\. Jin, S\. Khudanpur, S\. Watanabe, S\. Zhao, W\. Zou, X\. Li, X\. Yao, Y\. Wang, Y\. Wang, Z\. You, and Z\. Yan \(2021\)GigaSpeech: an evolving, multi\-domain ASR corpus with 10,000 hours of transcribed audio\.InProc\. Interspeech 2021,Cited by:[§5\.1\.1](https://arxiv.org/html/2606.13993#S5.SS1.SSS1.p1.1)\.
- C\. Haley \(2020\)This is a bert\. now there are several of them\. can they generalize to novel words?\.InProceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP,pp\. 333–341\.Cited by:[§1\.2](https://arxiv.org/html/2606.13993#S1.SS2.p1.1)\.
- Z\. Harmon and V\. Kapatsinski \(2017\)Putting old tools to novel uses: the role of form accessibility in semantic extension\.Cognitive Psychology98,pp\. 22–44\.Cited by:[§4](https://arxiv.org/html/2606.13993#S4.p1.1)\.
- Z\. Houghton, M\. Kato, M\. Baese\-Berk, and C\. Vaughn \(2024\)Task\-dependent consequences of disfluency in perception of native and non\-native speech\.Applied psycholinguistics45\(1\),pp\. 64–80\.Cited by:[footnote 5](https://arxiv.org/html/2606.13993#footnote5)\.
- Z\. N\. Houghton and E\. Morgan \(2023\)Does predictability drive the holistic storage of compound nouns?\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.45\.Cited by:[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- Z\. N\. Houghton and E\. Morgan \(2024\)Frequency\-dependent preference extremity arises from a noisy\-channel processing model\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.46\.Cited by:[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- Z\. N\. Houghton, K\. Sagae, and E\. Morgan \(2025\)The role of abstract representations and observed preferences in the ordering of binomials in large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 695–702\.Cited by:[§1\.2](https://arxiv.org/html/2606.13993#S1.SS2.p1.1)\.
- Z\. N\. Houghton \(2025a\)Multi\-word representations in minds and models: investigating the storage of multi\-word phrases in humans and large language models\.Ph\.D\. Thesis\.External Links:[Link](https://search.proquest.com/openview/59e91c3252682d11c048247ba152f037/1?pq-origsite=gscholar&cbl=18750&diss=y)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p2.1),[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p3.1),[§1](https://arxiv.org/html/2606.13993#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.13993#S3.SS3.p1.1),[§3](https://arxiv.org/html/2606.13993#S3.p1.1),[§4](https://arxiv.org/html/2606.13993#S4.p1.1),[§6](https://arxiv.org/html/2606.13993#S6.p5.1),[footnote 2](https://arxiv.org/html/2606.13993#footnote2)\.
- Z\. N\. Houghton \(2025b\)The bolts and nuts of language processing: an investigation into the noisy\-channel processing of binomials\.Master’s Thesis\.External Links:[Link](https://search.proquest.com/openview/48f06dad467d78a128dc477700ab1b34/1?pq-origsite=gscholar&cbl=18750&diss=y)Cited by:[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- V\. Kapatsinski, J\. Radicke, R\. Corrigan, E\. A\. Moravcsik, H\. Ouali, and K\. Wheatley \(2009\)Frequency and the emergence of prefabs\.Formulaic language2,pp\. 499–520\.Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p2.1),[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p3.1),[§1](https://arxiv.org/html/2606.13993#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.13993#S3.SS3.p1.1),[§3](https://arxiv.org/html/2606.13993#S3.p1.1),[§4](https://arxiv.org/html/2606.13993#S4.p1.1),[§6](https://arxiv.org/html/2606.13993#S6.p5.1)\.
- K\. Lasri, O\. Seminck, A\. Lenci, and T\. Poibeau \(2022\)Subject verb agreement error patterns in meaningless sentences: humans vs\. bert\.InProceedings of the 29th International Conference on Computational Linguistics,pp\. 37–43\.Cited by:[§1\.2](https://arxiv.org/html/2606.13993#S1.SS2.p1.1)\.
- R\. Levy, E\. Fedorenko, M\. Breen, and E\. Gibson \(2012\)The processing of extraposed structures in english\.Cognition122\(1\),pp\. 12–36\.Cited by:[footnote 1](https://arxiv.org/html/2606.13993#footnote1)\.
- J\. Liu, S\. Min, L\. Zettlemoyer, Y\. Choi, and H\. Hajishirzi \(2024\)Infini\-gram: scaling unbounded n\-gram language models to a trillion tokens\.arXiv preprint arXiv:2401\.17377\.Cited by:[§3\.1](https://arxiv.org/html/2606.13993#S3.SS1.p6.1)\.
- R\. T\. McCoy, P\. Smolensky, T\. Linzen, J\. Gao, and A\. Celikyilmaz \(2023\)How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven\.Transactions of the Association for Computational Linguistics11,pp\. 652–670\.External Links:[Link](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00567/116616)Cited by:[§1\.2](https://arxiv.org/html/2606.13993#S1.SS2.p1.1)\.
- D\. E\. Meyer and R\. W\. Schvaneveldt \(1971\)Facilitation in recognizing pairs of words: evidence of a dependence between retrieval operations\.\.Journal of experimental psychology90\(2\),pp\. 227\.External Links:[Link](https://psycnet.apa.org/record/1972-04123-001)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p1.1)\.
- K\. Misra and K\. Mahowald \(2024\)Language models learn rare phenomena from less rare phenomena: the case of the missing aanns\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 913–929\.Cited by:[§1\.2](https://arxiv.org/html/2606.13993#S1.SS2.p1.1)\.
- E\. Morgan and R\. Levy \(2015\)Modeling idiosyncratic preferences: how generative knowledge and expression frequency jointly determine language structure\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.37\.Cited by:[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- E\. Morgan and R\. Levy \(2016\)Abstract knowledge versus direct experience in processing of binomial expressions\.Cognition157,pp\. 384–402\.External Links:[Document](https://dx.doi.org/10.1016/j.cognition.2016.09.011),[Link](http://dx.doi.org/10.1016/j.cognition.2016.09.011)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p1.1),[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p2.1),[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p3.1),[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- E\. Morgan and R\. Levy \(2024\)Productive knowledge and item\-specific knowledge trade off as a function of frequency in multiword expression processing\.Language100\(4\),pp\. e195–e224\.External Links:[Link](https://muse.jhu.edu/pub/24/article/947046)Cited by:[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- S\. Pinker and M\. T\. Ullman \(2002\)The past and future of the past tense\.Trends in Cognitive Sciences6\(11\),pp\. 456–463\.External Links:[Document](https://dx.doi.org/10.1016/S1364-6613%2802%2901990-3),[Link](https://www.sciencedirect.com/science/article/pii/S1364661302019903)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p2.1)\.
- D\. Pluth, Z\. N\. Houghton, Y\. Zhou, and V\. K\. Gurbani \(2026\)Mechanistic interpretability of asr models using sparse autoencoders\.External Links:2605\.12225,[Link](https://arxiv.org/abs/2605.12225)Cited by:[§5\.3](https://arxiv.org/html/2606.13993#S5.SS3.p1.1),[§6](https://arxiv.org/html/2606.13993#S6.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.13993#S3.SS1.SSS1.p1.1)\.
- J\. P\. Stemberger and B\. MacWhinney \(2004\)Are inflected forms stored in the lexicon\.Morphology: Critical concepts in linguistics6,pp\. 107–122\.External Links:[Link](https://books.google.com/books?hl=en&lr=&id=bGl0aKBld3cC&oi=fnd&pg=PA107&dq=stemberger+2004+inflected&ots=RdvzVaC_NS&sig=0DJV8gUVaoZv_COZqcLXOu5_evU)Cited by:[§1\.1](https://arxiv.org/html/2606.13993#S1.SS1.p2.1),[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- J\. P\. Stemberger and B\. MacWhinney \(1986\)Frequency and the lexical storage of regularly inflected forms\.Memory & Cognition14\(1\),pp\. 17–26\.External Links:[Document](https://dx.doi.org/10.3758/BF03209225),[Link](http://link.springer.com/10.3758/BF03209225)Cited by:[§1](https://arxiv.org/html/2606.13993#S1.p1.1)\.
- S\. N\. Wood \(2017\)Generalized additive models: an introduction with r\.chapman and hall/CRC\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.13993#S3.SS1.SSS2.p4.1)\.
- Q\. Yao, K\. Misra, L\. Weissweiler, and K\. Mahowald \(2025\)Both direct and indirect evidence contribute to dative alternation preferences in language models\.arXiv preprint arXiv:2503\.20850\.Cited by:[§1\.2](https://arxiv.org/html/2606.13993#S1.SS2.p1.1)\.
- S\. Zhang, S\. Roller, N\. Goyal, M\. Artetxe, M\. Chen, S\. Chen, C\. Dewan, M\. Diab, X\. Li, X\. V\. Lin,et al\.\(2022\)Opt: open pre\-trained transformer language models\.arXiv preprint arXiv:2205\.01068\.Cited by:[§2](https://arxiv.org/html/2606.13993#S2.p1.1)\.
## Appendix ABabyLM Model Training
Models were trained from random initialization for 20 epochs using the AdamW optimizer with fused weight updates and bfloat16 mixed precision\. The learning rate was set to3×10−43\\times 10^\{\-4\}for the 125M model and1×10−41\\times 10^\{\-4\}for the 350M and 1\.3B models, each preceded by a linear warmup over 10% of total training steps\. Training was distributed across two NVIDIA A100 80GB GPUs\. Hyperparameter details are given in Table[1](https://arxiv.org/html/2606.13993#A1.T1)\.
Table 1:Hyperparameters for the three BabyLM models\. All models share a BPE tokenizer \(vocabulary size 8,192\) trained on the BabyLM corpus\. Training used two NVIDIA A100 80GB GPUs with distributed data parallelism\.
## Appendix BBabyLM Model Convergence
All three models trained to completion over 20 epochs, with training loss decreasing monotonically throughout \(Figure[7](https://arxiv.org/html/2606.13993#A2.F7)\)\. Final gradient norms were small for all models \(0\.16, 0\.24, and 0\.38 for the 125M, 350M, and 1\.3B models, respectively\), and the learning rate decayed to zero at the end of training, consistent with full convergence under the scheduled warmup\-then\-decay regime\. No signs of loss divergence or instability were observed\.
The 1\.3B model achieves the lowest training loss \(2\.55\) and validation perplexity \(15\.18\), as expected given its greater capacity \(Table[2](https://arxiv.org/html/2606.13993#A2.T2)\)\. The 350M model, however, shows slightly elevated training loss \(2\.63\) and validation perplexity \(18\.86\) relative to the 125M model \(training loss 2\.55; validation perplexity 16\.81\)\. This is likely attributable to the lower learning rate applied to the 350M model \(1×10−41\\times 10^\{\-4\}vs\.3×10−43\\times 10^\{\-4\}for the 125M\), which, combined with a relatively modest dataset size of 150M tokens, may have produced slower effective convergence\. The effect is small and does not affect our qualitative comparisons, but we note it here for completeness\.
Figure 7:Training loss over 20 epochs for each BabyLM model\. Loss is logged every 10 gradient steps\.Table 2:Final validation loss and perplexity for each BabyLM model, evaluated on the BabyLM development set after 20 epochs of training\.
## Appendix CTest Set Items
Table 3:Test set statistics for all experiments\. Only items with a valid \(non\-zero\) predictability estimate are included\. Frequency is the raw corpus count; predictability is the log odds \(see Equation[2](https://arxiv.org/html/2606.13993#S3.E2)\)\. BabyLM models \(125M, 350M, 1\.3B\) share identical items and corpus statistics\. Whisper encoder and decoder evaluate the same audio items with Dolma\-derived frequency counts\.
## Appendix DExperiment 1: UP Independently
Table 4:Classifier training and validation split sizes for Experiment 1 \(UP independently\)\. Positive \(pos\) = standalone*up*; negative \(neg\) = other alphabetic token from the same sentence\.Table 5:Joint frequency × predictability model at the final layer for UP independently\. Estimates are from Bayesian linear mixed\-effects models \(brms\); % \> 0 indicates the percentage of posterior samples with a positive estimate\.Table 6:GAM tensor\-product smooth summary for UP independently\. EDF = estimated degrees of freedom; p = p\-value for the te\(predictor, layer\) smooth term\.Figure 8:Layer\-by\-layer difference in GAM\-predicted logit between high \(90th percentile\) and low \(10th percentile\) predictor values, for UP independently\. Green = frequency effect; yellow = predictability effect\. Inner and outer shaded ribbons show 80% and 95% confidence intervals derived from the GAM standard error\. A negative difference indicates that high\-predictor phrases yield a lower classifier logit \(i\.e\., their representation of*up*is less similar to the standalone class\)\.
## Appendix EExperiment 2: UP as Subword
Table 7:Classifier training and validation split sizes for Experiment 2 \(UP as subword\)\. Positive examples: 1,000 standalone*up*\+ 1,000 unique up\-within\-word types\. Negative examples: 1,000 from each sentence pool\. Val sizes vary by model due to tokenizer coverage\.Table 8:Joint frequency × predictability model at the final layer for UP as subword\. Estimates are from Bayesian linear mixed\-effects models \(brms\); % \> 0 indicates the percentage of posterior samples with a positive estimate\.Table 9:GAM tensor\-product smooth summary for UP as subword\. EDF = estimated degrees of freedom; p = p\-value for the te\(predictor, layer\) smooth term\.Figure 9:Layer\-by\-layer difference in GAM\-predicted logit between high \(90th percentile\) and low \(10th percentile\) predictor values, for UP as subword\. Green = frequency effect; yellow = predictability effect\. Inner and outer shaded ribbons show 80% and 95% confidence intervals derived from the GAM standard error\. A negative difference indicates that high\-predictor phrases yield a lower classifier logit \(i\.e\., their representation of*up*is less similar to the standalone class\)\.
## Appendix FExperiment 3: Whisper
Table 10:Classifier training and validation split sizes for Experiment 3 \(Whisper\)\. Positive examples are*word\_up*instances; negatives are random non\-*up*alphabetic tokens from the same segment\.Table 11:Joint frequency × predictability model at the final layer for Whisper encoder and decoder\. Estimates are from Bayesian linear mixed\-effects models \(brms\); % \> 0 indicates the percentage of posterior samples with a positive estimate\.Table 12:GAM tensor\-product smooth summary for Whisper encoder and decoder\. EDF = estimated degrees of freedom; p = p\-value for the te\(predictor, layer\) smooth term\.Figure 10:Layer\-by\-layer difference in GAM\-predicted logit between high \(90th percentile\) and low \(10th percentile\) predictor values, for Whisper encoder and decoder\. Green = frequency effect; yellow = predictability effect\. Inner and outer shaded ribbons show 80% and 95% confidence intervals derived from the GAM standard error\. A negative difference indicates that high\-predictor phrases yield a lower classifier logit \(i\.e\., their representation of*up*is less similar to the standalone class\)\.Similar Articles
Are you speaking my languages? On spoken language adherence in multimodal LLMs
This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.
Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent
This paper investigates whether large language models encode syntactic abstractions like phase boundaries that are not captured by Universal Dependencies, using structural probes on wh-movement stimuli with invariant UD distances, finding evidence across 13 LLMs for phase-structure representations that are causally active.
On the Persistent Effects of Lexicality in Large Language Mod
This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Proposes TextPro-SLM, a speech large language model that minimizes the modality gap by processing spoken input to resemble prosody-aware text input, achieving strong paralinguistic understanding with low training data.
Language models struggle with compartmentalization
This paper investigates compartmentalization in LLMs, where models fail to share statistical strength across distinct representations of the same concept, leading to reduced sample efficiency and model capacity. The authors demonstrate this phenomenon in multilingual and multi-format settings and show that synthetic parallel data does not fully resolve it.