What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Summary
This paper proposes the Implicit Curriculum Hypothesis, demonstrating that language model pretraining follows a structured, compositional curriculum where capabilities emerge consistently across architectures and can be predicted from internal representations. The authors validate this through designed tasks spanning retrieval, morphology, coreference, reasoning, and mathematics, finding highly consistent emergence orderings (ρ=0.81) across four model families.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
Paper page - What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Source: https://huggingface.co/papers/2604.08510
Abstract
Pretraining follows a structured, compositional curriculum where model capabilities emerge consistently across different architectures and can be predicted from internal representations.
Large language models (https://huggingface.co/papers?q=Large%20language%20models)(LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge duringpretraining (https://huggingface.co/papers?q=pretraining)remain poorly understood.Scaling laws (https://huggingface.co/papers?q=Scaling%20laws)on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose theImplicit Curriculum Hypothesis (https://huggingface.co/papers?q=Implicit%20Curriculum%20Hypothesis):pretraining (https://huggingface.co/papers?q=pretraining)follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we trackemergence points (https://huggingface.co/papers?q=emergence%20points)across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent (ρ= .81 across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded inmodel representations (https://huggingface.co/papers?q=model%20representations): tasks with similarfunction vector representations (https://huggingface.co/papers?q=function%20vector%20representations)also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict thetraining trajectories (https://huggingface.co/papers?q=training%20trajectories)of simple held-outcompositional tasks (https://huggingface.co/papers?q=compositional%20tasks)throughout the course ofpretraining (https://huggingface.co/papers?q=pretraining)(R^2 = .68-.84 across models) without previously evaluating them. Together, these results suggest thatpretraining (https://huggingface.co/papers?q=pretraining)is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.
View arXiv page (https://arxiv.org/abs/2604.08510)View PDF (https://arxiv.org/pdf/2604.08510)GitHub5 (https://github.com/KaiserWhoLearns/ElementalTask)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.08510)
Get this paper in your agent:
hf papers read 2604.08510
Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.08510 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.08510 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.08510 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollection (https://huggingface.co/new-collection)to link it from this page.
Similar Articles
Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs
This paper provides causal evidence that large language models acquire negative linguistic knowledge (what not to say) through statistical preemption, a mechanism from Construction Grammar, by showing that manipulating competing-form frequencies via fine-tuning shifts preemption behavior in predicted directions.
Language Acquisition Device in Large Language Models
This paper proposes LAD-inspired pre-pretraining using a formal language called MP-Struct that encodes natural-language-like structures. It shows that this approach improves token efficiency and imparts human-like resistance to structurally implausible languages, challenging prior hypotheses about effective pre-pretraining languages.
Model Collapse as Cultural Evolution
This paper reframes model collapse in LLMs as a cultural transmission phenomenon, showing that iterated learning theory predicts a non-monotonic trajectory of compositionality under self-training, confirmed across multiple languages and models.
Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
This paper introduces PyLang, a programming language absent from all pretraining corpora, and shows that LLMs fine-tuned on it can learn syntax but fail to transfer algorithmic reasoning, resulting in an 'implementation fidelity gap' where models understand algorithms but cannot express them in an unfamiliar language.
CS336: Language Modeling from Scratch
Stanford is offering a comprehensive course, CS336, where students build a language model from scratch, covering data collection, transformer construction, training, and evaluation.