LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
Summary
This paper introduces LEVANTE-bench, a benchmark that systematically evaluates vision-language models on six cognitive tasks and compares their performance to children aged 5-12, finding that current VLMs align only partially with children's cognitive abilities.
View Cached Full Text
Cached at: 06/05/26, 08:11 AM
# Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, “Is Your VLM Smarter Than a 5th Grader?”)
Source: [https://arxiv.org/html/2606.05497](https://arxiv.org/html/2606.05497)
Alvin Wei Ming Tan David Cardinal Tania Lorido\-Botrán Laura Bravo\-Sánchez Sunny Yu Michael C\. Frank Stanford University \{tanawm, david81, botran, lmbravo, syu03, mcfrank\}@stanford\.edu
###### Abstract
Given the inherently multimodal nature of human experience, vision–language models \(VLMs\) hold substantial promise for modeling human cognition as it grows and develops with experience\. Realizing their potential requires tools for comparing VLMs with human cognitive development across tasks, ages, and populations\. We present LEVANTE\-bench, a benchmark based on tasks and data from the Learning Variability Network \(LEVANTE\), which distributes open\-source tasks and data measuring children’s cognition across languages and cultures\. In LEVANTE\-bench, we systematically assess VLMs on six tasks, comparing their alignment with children aged 5–12 \(NN= 1547\) across three countries\. We compare models at multiple scales, assessing their overall accuracy, their task\- and item\-level alignment with children, and how well they match children’s trial\-level error distributions\. Alignment was heterogeneous across scales: at the level of tasks and items, more capable models aligned better with humans\. However, match to human error distributions varied widely across tasks, and for several tasks, smaller models matched younger children’s errors better\. In addition, even the best\-performing VLMs struggled on matrix reasoning and mental rotation tasks\. Thus, current VLM architectures align only partially with the cognitive abilities of children\.
## 1Introduction
Artificial intelligence models hold substantial promise as scientific tools for understanding human learning and cognition\[[40](https://arxiv.org/html/2606.05497#bib.bib16),[36](https://arxiv.org/html/2606.05497#bib.bib15)\]\. Although the rise of language models has led to an explosion of cognitive evaluation work\[[54](https://arxiv.org/html/2606.05497#bib.bib17),[3](https://arxiv.org/html/2606.05497#bib.bib9),[14](https://arxiv.org/html/2606.05497#bib.bib14)\], the unimodal nature of such models is an important limitation in using them to study and understand human cognition\[[32](https://arxiv.org/html/2606.05497#bib.bib315)\]\. In particular, for researchers interested in the efficiency and robustness of human learning, it is striking that language models are trained on much*more*language than humans and much*less*data of any other type\[[16](https://arxiv.org/html/2606.05497#bib.bib192),[53](https://arxiv.org/html/2606.05497#bib.bib85)\]\. Visual experience is an especially rich form of data that enables learners to acquire information about the causal structure of the world\[[2](https://arxiv.org/html/2606.05497#bib.bib157),[22](https://arxiv.org/html/2606.05497#bib.bib104),[1](https://arxiv.org/html/2606.05497#bib.bib289)\]\.
Figure 1:\(A\) Example items from the six tasks of LEVANTE\-bench as presented to human participants; adapted from\[[24](https://arxiv.org/html/2606.05497#bib.bib19)\]\. \(B\) Overall accuracies of models \(colored by family\) and humans \(red\)\.Vision–language models \(VLMs\)\[[34](https://arxiv.org/html/2606.05497#bib.bib187),[28](https://arxiv.org/html/2606.05497#bib.bib162)\]thus offer important opportunities for cognitive modeling, especially for understanding human development\. First, VLMs can be compared to human cognitive abilities\[[37](https://arxiv.org/html/2606.05497#bib.bib21)\]and can even be pre\-trained on human visual experiences\[[49](https://arxiv.org/html/2606.05497#bib.bib184),[48](https://arxiv.org/html/2606.05497#bib.bib219),[50](https://arxiv.org/html/2606.05497#bib.bib290),[51](https://arxiv.org/html/2606.05497#bib.bib291)\]\. In addition, VLMs can be evaluated using the multimodal format that is most common in experiments with children\[[12](https://arxiv.org/html/2606.05497#bib.bib293),[15](https://arxiv.org/html/2606.05497#bib.bib292)\]\. These two observations mean that VLMs can potentially be used to simulate important aspects of human development, allowing inferences about what behaviors arise from powerful statistical learning mechanisms and rich experiences\. The promise of such models is that they can formalize and implement scientific theories from cognitive science, such as helping us better understand which aspects of cognitive development are innately specified\[[49](https://arxiv.org/html/2606.05497#bib.bib184),[10](https://arxiv.org/html/2606.05497#bib.bib294),[41](https://arxiv.org/html/2606.05497#bib.bib296)\]\.
For VLMs to be used as cognitive models of learning, they must be evaluated in parallel with human behavioral performance – ideally, on learning trajectories from children\. Many such efforts leverage experiments on visual cognition in adults\[[37](https://arxiv.org/html/2606.05497#bib.bib21),[6](https://arxiv.org/html/2606.05497#bib.bib298),[42](https://arxiv.org/html/2606.05497#bib.bib301)\]\. However, some recent developmentally\-inspired VLM benchmarks have assessed models on concepts and domains typically studied in children\[[50](https://arxiv.org/html/2606.05497#bib.bib290),[27](https://arxiv.org/html/2606.05497#bib.bib297),[7](https://arxiv.org/html/2606.05497#bib.bib299)\], and some have compared VLMs directly to data from children\[[50](https://arxiv.org/html/2606.05497#bib.bib290),[44](https://arxiv.org/html/2606.05497#bib.bib220),[60](https://arxiv.org/html/2606.05497#bib.bib300)\]\.
However, comparisons are limited by the challenges of collecting child data\. Most datasets are English\-only, decreasing the generalizability of cognitive comparisons\[[4](https://arxiv.org/html/2606.05497#bib.bib302)\]\. Few benchmarks compare humans and models at the item level; instead, most only assess overall accuracy\[[45](https://arxiv.org/html/2606.05497#bib.bib256)\]\. In addition, none of these benchmarks use tasks that have been psychometrically assessed for reliability and validity\[[47](https://arxiv.org/html/2606.05497#bib.bib3)\]\. Finally, very few include data from multiple tasks on the same sample of children\. This last point is especially important given that making cross\-task comparisons may be relatively meaningless if the tasks are calibrated on different samples\. The current paper aims to fill these gaps\.
We take advantage of a new resource: the Learning Variability Network Exchange \(LEVANTE\)\[[13](https://arxiv.org/html/2606.05497#bib.bib20)\]\. LEVANTE provides a set of tasks for measuring children’s learning and development, a task administration framework for researchers, and a global dataset of observations gathered from these tasks\. The LEVANTE core tasks, in particular, are a set of psychometrically validated tasks that can be used to study learning development in children aged 5–12, spanning math, executive function, reading, language, spatial cognition, social cognition, and reasoning\[[24](https://arxiv.org/html/2606.05497#bib.bib19)\]\. These tasks are hosted on a central platform, so that as researchers use them for data collection, their data flows into a centralized repository for open distribution\. All LEVANTE data and task assets are licensed for non\-commercial usage, meaning that they can be reused for VLM evaluation\. Further, all tasks are typically given to the same set of children, meaning that task difficulty between children can be fairly compared\.
In this work, we construct an evaluation benchmark \(LEVANTE\-bench\) allowing systematic comparison of VLM performance with human cognitive development across tasks spanning math, reasoning, language, and social cognition \(Figure[1](https://arxiv.org/html/2606.05497#S1.F1)A\) and across three languages \(English, Spanish, and German\), using data from more than 1500 children\. Tasks vary in difficulty: the easiest can be solved by small\-scale open models, while the hardest are still challenging for current commercial frontier models\. Our first contribution is thus to provide the largest and most comprehensive dataset to date for comparing VLMs to children’s cognition\.
Our second contribution is to provide a framework for multi\-scale comparison of model and humans\. In particular, we measure model–human alignment across three scales: alignment on task difficulties, alignment on item difficulties within tasks, and alignment on trial\-level error distributions\.111We use the term “alignment” to denote a high level of correspondence between human and model on a particular metric, rather than to denote correspondence specifically on goals and values \(as the term is used in discussions of AI safety\)\.The results of this analysis suggest that alignment is varied across scales: while larger models broadly align on task difficulty, all models are at best modestly aligned on item difficulty and are heterogeneous in their trial\-level alignment\. Together these results highlight gaps in VLM cognitive alignment\.
## 2Prior work
A vast literature compares text\-only language models with human cognition; overall, human–model behavioral alignment for models is striking\[[54](https://arxiv.org/html/2606.05497#bib.bib17),[58](https://arxiv.org/html/2606.05497#bib.bib304),[19](https://arxiv.org/html/2606.05497#bib.bib305)\]and improves with fine\-tuning\[[3](https://arxiv.org/html/2606.05497#bib.bib9)\], though there are certainly still areas of lower alignment\[[21](https://arxiv.org/html/2606.05497#bib.bib303),[57](https://arxiv.org/html/2606.05497#bib.bib312),[31](https://arxiv.org/html/2606.05497#bib.bib13)\]\. A smaller but still extensive set of benchmarks and studies explicitly compare learning trajectories to human development\[[38](https://arxiv.org/html/2606.05497#bib.bib11),[59](https://arxiv.org/html/2606.05497#bib.bib10)\]; many of these contain linguistic reframings of tasks that are typically given to children in multimodal formats\. Theory of mind evaluation is an example of such an evaluation: Children’s understanding of others’ beliefs is typically assessed using picturebook tasks\[[56](https://arxiv.org/html/2606.05497#bib.bib306)\], but the vast majority of LLM theory of mind evaluations are in text\-only formats\[[21](https://arxiv.org/html/2606.05497#bib.bib303),[25](https://arxiv.org/html/2606.05497#bib.bib307)\]\.
Our goal here is to quantify the broad developmental alignment between VLMs and children in multiple domains of cognition\. Most relevant to our current work are developmental comparisons of VLMs to human data and phenomena\. Several of these are concentrated in specific domains, such as visual cognition\[[27](https://arxiv.org/html/2606.05497#bib.bib297),[7](https://arxiv.org/html/2606.05497#bib.bib299),[55](https://arxiv.org/html/2606.05497#bib.bib310),[39](https://arxiv.org/html/2606.05497#bib.bib308)\], word learning\[[23](https://arxiv.org/html/2606.05497#bib.bib309)\], language\[[45](https://arxiv.org/html/2606.05497#bib.bib256)\], and relational reasoning\[[60](https://arxiv.org/html/2606.05497#bib.bib300)\]\. Most of these do not seek broad coverage over multiple domains, with the exception ofWanget al\.\[[50](https://arxiv.org/html/2606.05497#bib.bib290)\], who compare VLM abilities to tasks described by the NIH Baby Toolbox, a broad coverage tool for measuring cognition in early childhood\[[18](https://arxiv.org/html/2606.05497#bib.bib311)\]; however, data and stimulus accessibility issues limit their ability to directly compare children and models\. Perhaps most closely related,Tanet al\.\[[45](https://arxiv.org/html/2606.05497#bib.bib256)\]proposed a method for comparing children’s response distributions to those of models and applied it across a range of language tasks\. Unlike LEVANTE\-bench, these works lack human data spanning across ages and tasks, limiting the strength of cross\-task comparisons that can be performed\.
## 3Human tasks and data
LEVANTE includes an open collection of measures of child learning and development designed to be useful in multiple countries, cultures, and languages\[[13](https://arxiv.org/html/2606.05497#bib.bib20)\]\. All tasks were designed for children aged 5–12 years with a planned downward extension from 2–5 years\. Currently, the tasks are available in English, Spanish, and German\. All data collected through the LEVANTE framework are released for open reuse on Redivis\.222See[http://researcher\.levante\-network\.org](http://researcher.levante-network.org/)for more details about the project, tasks, and data\. Note that use of LEVANTE data requires affirmation of a data use agreement\.We selected six tasks from the LEVANTE core task battery\[[24](https://arxiv.org/html/2606.05497#bib.bib19)\]based on their suitability for processing by VLMs and their use of a simple multiple\-choice format \(see Figure[1](https://arxiv.org/html/2606.05497#S1.F1)A\):
- •Math\. Each item is a math question that includes simple 4\-alternative forced choice \(4\-AFC\) number identification, comparison, arithmetic, and fraction problems\. Number\-line items were excluded because they did not use a multiple choice format\.
- •Matrix reasoning\. Each item is a 3×\\times3 matrix of images that form a pattern, with the lower right element omitted\. The participant must deduce which option in four fits the pattern \(4\-AFC\)\.
- •Theory of mind\. Each item is a story with a 2\-, 3\- or 4\-AFC question tapping reasoning about the beliefs and emotions of the individuals in the stories\.
- •Mental rotation\. Each item is a shape \(either a 2D silhouette or a 3D shape\) that must be rotated to match to one of two targets \(one rotated and one mirrored; 2\-AFC\.
- •Sentence understanding\. Each item is a sentence that must be matched with one of four pictures \(4\-AFC\)\.
- •Vocabulary\. Each item is a word that must be matched with one of four pictures \(4\-AFC\)\.
We use human response data from the LEVANTE 2026\.1 data release, which contains data collected from 5–12\-year\-old children in Colombia \(NN= 1020\), Canada \(NN= 188\), and Germany \(NN= 339\) for a total of 309,108 trial\-level responses across all tasks\.333The larger number of participants in Colombia is because not all participants performed all tasks in this context, thus more participants were collected to reach the target participant count in each task\.Data in Colombia were collected in schools; data in Canada were collected in a laboratory setting; and data in Germany were collected remotely at the participants’ homes\. SeeKachergiset al\.\[[24](https://arxiv.org/html/2606.05497#bib.bib19)\]for more details on data collection\.
## 4Human–Model Comparison
LEVANTE tasks contain items with a wide range of difficulty suitable for both younger and older children\. To avoid boredom and frustration, not all participants saw all items; instead, after an initial period of normative data collection, tasks were made adaptive so that items were selected online based on their estimated difficulty for each task taker\. Individual children’s scores are thus not determined based on their proportion of correct answers on all items, but instead using an item\-response theory \(IRT\) model that assigns an ability scoreθ\\thetadetermined by fitting a model of the formP\(ri,j=1\|θi,δj\)=eθi−δj1\+eθi−δjP\(r\_\{i,j\}=1\\,\|\\,\\theta\_\{i\},\\delta\_\{j\}\)=\\frac\{e^\{\\theta\_\{i\}\-\\delta\_\{j\}\}\}\{1\+e^\{\\theta\_\{i\}\-\\delta\_\{j\}\}\}, whereri,jr\_\{i,j\}is the response for participantiiand itemjj\[[11](https://arxiv.org/html/2606.05497#bib.bib313),[46](https://arxiv.org/html/2606.05497#bib.bib2)\]\.444In practice, scoring models for the LEVANTE core tasks make use of either the Rasch model \(as described in the text\) or a 2\-parameter logistic model \(2PL\) for the mental rotation task, and all models also included a per\-item guessing lower bound which indicated the chance level for each item based on how many options were available\. All models except models for the vocabulary task are multi\-group IRT models selected based on model comparison\[[5](https://arxiv.org/html/2606.05497#bib.bib314)\]; see\[[24](https://arxiv.org/html/2606.05497#bib.bib19)\]for details\.
#### Cross\-task alignment\.
To estimate task difficulty, we used theθ\\thetaandδ\\deltavalues to estimate human performance on all trials for all participants, even trials that they did not see\. We then estimated the mean accuracy across all trials for each task, and correlated model and human task accuracies\.
#### Within\-task alignment\.
We compared human item difficultyδ\\deltato model accuracy, calculated as the proportion of runs in which models correctly responded to a given trial\. For ease of interpretation, we negated item difficulty values to obtain item easiness values, such that greater model–human alignment would be indicated by higher correlations\.
#### Trial\-level error alignment\.
We also make use of raw LEVANTE trial data to recover item\-level empirical response distributions for each item over the available answer options\. These data allow us to evaluate models based not just on their accuracy, but also on the similarity of their answer distributions with human response patterns\. In particular, we use a method similar toTanet al\.\[[45](https://arxiv.org/html/2606.05497#bib.bib256)\]: for each item, we computed the Kullback–Leibler divergenceDKL\(d∥m\)D\_\{\\textup\{\\small KL\}\}\(d\\parallel m\)between the estimated multinomial response distributions for a modelmm\(empirically estimated from multiple runs\) and human response distributionsdd, which we treat as the ground truth distribution\. We addedϵ=10−12\\epsilon=10^\{\-12\}to the proportion of each option to smooth the distributions\. Because higher\-ability children were more likely to see more difficult trials \(e\.g\., older children are the only ones who see fractions trials in the math task\), human response distributions are not comparable for items of different difficulty\. To avoid this issue, we stratify response distributions by participant ability \(θ\\theta\) and consider the similarity of the model to the response distributions of participants at the same general level\.
## 5Experimental setup
#### Models\.
Our goal is to be able to compare models with humans within model families and across model scales\. We thus selected a variety of open\-weight VLM families that span a range of parameter sizes: Gemma 4 \(E2B, E4B, 26B, 31B\); InternVL 3\.5 \(1B, 2B, 4B, 8B, 14B, 38B\); Qwen 3\.5 \(0\.8B, 2B, 4B, 9B, 27B\); SmolVLM 2 \(256M, 500M, 2\.2B\); and TinyLLaVA \(3\.1B\)\. Models were run on a combination of compute resources depending on availability, including an NVIDIA GeForce RTX 3090, several NVIDIA A40s, and an NVIDIA DGX H100 supercomputing cluster, or via the HuggingFace API\. We also ran three closed\-weight commercial frontier models via API to provide a topline comparison: GPT 5\.3, Gemini 2\.5 Pro, and Gemini 3 Flash\.555For closed\-weight models, we used estimates of their number of parameters from\[[26](https://arxiv.org/html/2606.05497#bib.bib317),[43](https://arxiv.org/html/2606.05497#bib.bib318)\]; we acknowledge that these estimates are preliminary and noisy, but they likely do not make a significant difference to our results because they are definitely at least an order of magnitude larger than the open\-weight models we evaluated\.
#### Evaluation configuration\.
To estimate error distributions and minimize response bias effects\[[9](https://arxiv.org/html/2606.05497#bib.bib319)\], we evaluated each model a minimum of 10 times on each item, permuting response options between runs\. We ran all models with max tokens set to 1024 for open\-weight models and 2048 for commercial models\. We used thinking mode and FlashAttention\-2 for models that support them, and 16\-bit floating point precision for consistency\. Images were resized to 512×\\times512 pixels\. All experiment code as well as downloaders for LEVANTE assets and data can be found on[GitHub](https://anonymous.4open.science/r/levante-bench-3013/)\.
#### Question formatting\.
Item prompts for the LEVANTE tasks were adjusted from the original tasks to account for differences in presentation format \(e\.g\., non\-sequential presentation\)\. Prompt text \(and the prompt image\(s\), if relevant\) was passed to the model along with the possible response options\. Response options were randomized and each option was labeled with a single capital letter \(A through D\)\. Most models were then instructed to provide a response in JSON format with the keys “answer” \(a single capital letter\) and “reason” \(open\-ended text\); these responses were parsed with a parser that incorporated light error correction\. Some of the smaller models \(SmolVLM 2\) were unable to accurately return JSON; instead, we instructed those models to return a single capital letter\.
#### Prompt sensitivity study\.
Prior to the main evaluation, we conducted an extensive task prompt sensitivity study spanning 5 model families \(Qwen, InternVL, Gemma, SmolVLM 2, SpaceThinker\) and all six tasks \(Appendix[A](https://arxiv.org/html/2606.05497#A1)\)\. We tested a wide range of strategies – structured layouts, chain\-of\-thought \(CoT\), few\-shot exemplars, self\-consistency, task\-specific expert framing, and elimination instructions – and found three consistent patterns\. First, the optimal prompt varied by model family and scale: enriched prompts that helped larger models \(≥\\geq4B\) often hurt smaller ones\. Second, CoT and other reasoning elicitation strategies frequently*decreased*accuracy for sub\-4B models substantially, while also lowering parse rates\. Third, a minimal prompt with a JSON output format achieved competitive worst\-case accuracy across all model\-by\-task cells\. Based on these findings, we adopted a single default task prompt for all main experiments to ensure fair comparison\.
Figure 2:VLM accuracies plotted by log10parameters \(estimated for commercial models\)\. Error bars indicate bootstrapped 95% confidence intervals\. Colors denote model families\. Dotted lines indicate chance levels, which vary between tasks due to varying numbers of response options\. Dashed lines show best fitting logistic regressions\.
## 6Results
We first examine overall model accuracies on LEVANTE\-bench in English \(§[6\.1](https://arxiv.org/html/2606.05497#S6.SS1)\)\. Next, we compare cross\-task \(§[6\.2](https://arxiv.org/html/2606.05497#S6.SS2)\) and within\-task \(§[6\.3](https://arxiv.org/html/2606.05497#S6.SS3)\) alignment in accuracies as well as alignment in response distributions \(§[6\.4](https://arxiv.org/html/2606.05497#S6.SS4)\)\. Finally, we report cross\-linguistic results for German and Spanish \(§[6\.5](https://arxiv.org/html/2606.05497#S6.SS5)\)\.
### 6\.1Model accuracy
Figure[1](https://arxiv.org/html/2606.05497#S1.F1)B shows the overall performance of models and humans across all six tasks in LEVANTE\-bench\. Figure[2](https://arxiv.org/html/2606.05497#S5.F2)shows model performance on each task as a function of model class and the number of parameters of the model\. Recapitulating prior work showing systematic relationships between model size and performance\[[35](https://arxiv.org/html/2606.05497#bib.bib316)\], we found that larger models were more accurate across tasks, with performance following an approximately logistic form; this trend was also broadly observed within model classes\. Models performed relatively well in language tasks \(vocabulary and sentence understanding\) and in the math task, but poorly in spatial and relational reasoning tasks \(mental rotation and matrix reasoning\)\. In particular, all models were approximately at chance for mental rotation, and even commercial frontier models did not reach high levels of performance \(cf\.\[[42](https://arxiv.org/html/2606.05497#bib.bib301)\]\)\.
Figure 3:Correlation between model and human task accuracies plotted by \(A\) log10parameters and \(B\) overall accuracy\. Error bars indicate bootstrapped 95% confidence intervals\. Dotted line indicates zero\. Dashed black lines show best fitting sigmoid regressions\. Dashed red line and shaded region indicate human split\-half correlation and bootstrapped 95% confidence interval\.
### 6\.2Cross\-task alignment
Across our six tasks, humans and models were positively correlated with respect to task difficulty\. In fact, larger and better\-performing models tended to be more correlated with humans \(Figure[3](https://arxiv.org/html/2606.05497#S6.F3)\), although even commercial models remained significantly below human split\-half correlations, estimated by calculating the task accuracy correlation between splits for 1000 bootstrapped random splits of human participants\. This gap appears to be driven by models’ high performance on vocabulary and very low performance on mental rotation\.
### 6\.3Within\-task alignment
Next, we examined model–human alignment on the relative item difficulties within a task\. Correlations were mostly positive but small across our models \(Figure[4](https://arxiv.org/html/2606.05497#S6.F4)\); larger models showed slightly higher correlations with humans\. Additional analysis on item subtypes \(Appendix[B\.2](https://arxiv.org/html/2606.05497#A2.SS2)\) suggests that models and humans diverge on what kinds of trials are easy – for example, humans are relatively better than models on addition but worse at fractions, and found 2D mental rotation easier while models found 3D easier; these differences resulted in relatively modest within\-task alignment overall\.
Figure 4:Correlation between model accuracy and item easiness estimated from human performance\. Error bars indicate bootstrapped 95% confidence intervals\. Dotted lines indicate zero\. Dashed lines show best fitting linear regressions\. Correlations could not be estimated for the math task for GPT 5\.3 and Gemini 2\.5 Pro, as they had 100% accuracy in retained items and thus no variance across items\.
### 6\.4Trial\-level error alignment
Next, we investigated the alignment of response distributions between models and humans at the trial level, providing a finer\-grained comparison on whether their error patterns are similar\. We binned participants by ability \(withθ\\thetabin size 1\), and calculated the response distribution within each bin, excluding trials that were administered fewer than 10 times within a bin\. We then calculated the Kullback–Leibler divergence \(DKLD\_\{\\textup\{\\small KL\}\}\) between model response patterns and human response patterns\.
Patterns of model–human divergence were highly heterogeneous between tasks \(Figure[5](https://arxiv.org/html/2606.05497#S6.F5)\), showing three patterns\. First, math, sentence understanding, and theory of mind \(not shown\) showed a size effect: Larger models better matched higher\-ability humans, while the smallest models matched lower\-ability \(lowθ\\theta\) humans \(who were often younger\)\. This pattern suggests that smaller models, while lower accuracy, can still match lower\-ability children’s errors \(replicating findings in\[[45](https://arxiv.org/html/2606.05497#bib.bib256)\]\)\. Second, for vocabulary and matrix reasoning, error pattern alignment was more model\-specific, andDKLD\_\{\\textup\{\\small KL\}\}tended to be similar across ability bins \(with a few notable exceptions, including the largest models for matrix reasoning\)\. Third, mental rotation \(not shown\) had relatively low and constantDKLD\_\{\\textup\{\\small KL\}\}values, perhaps due to overall poor differentiation across models on this task\. See Appendix[B\.1](https://arxiv.org/html/2606.05497#A2.SS1)for fullDKLD\_\{\\textup\{\\small KL\}\}results\.
Figure 5:Tasks showed markedly different patterns of trial\-level alignment\.DKLD\_\{\\textup\{\\small KL\}\}between model and human response distributions plotted by \(A\) log10number of parameters \(for math and sentence understanding\) and \(B\) human ability bins \(for vocabulary and matrix reasoning\)\. Lower values indicate greater model–human alignment\.
### 6\.5Cross\-linguistic results
Figure 6:Comparison of model accuracies on English versus German/Spanish versions of our tasks\.We additionally tested a subset of models using LEVANTE German and Spanish item translations\.666Some models were run only once in German and Spanish, rather than 10 times as in English\.We used the professionally translated item prompts from the LEVANTE framework, and machine translation for the general task prompts\. Models show relatively high correlation between English and German/Spanish accuracies across tasks \(Figure[6](https://arxiv.org/html/2606.05497#S6.F6)\)\. However, many models showed lower vocabulary scores and some model sizes showed lower performance on the math task\. Overall, however, these results demonstrate models’ robustness across languages and suggest an opportunity for investigating cognitively plausible multilingual training regimes\.777By design, the LEVANTE framework does not support human cross\-site comparisons of cognitive ability due to many differences in sampling and administration across sites\[[13](https://arxiv.org/html/2606.05497#bib.bib20)\]\. However, models which claim to be multilingual should have equitable access and performance across languages, permitting direct comparison on LEVANTE\-bench\.
## 7Discussion
Developing VLMs as models of human learning will require benchmarks to assess their alignment with children’s behavior\. These benchmarks must span across cognitive abilities and ages, and should ideally assess children across linguistic and cultural groups\. LEVANTE\-bench takes a step towards accomplishing these goals by leveraging a large, cross\-national collaborative open science project to use validated tasks and a large dataset for human–model comparison\. Critically, rather than examining only absolute task performance, LEVANTE\-bench affords multi\-scale comparison at the level of tasks, items, and error distributions\. We found that current VLMs show moderate task\-level alignment that nonetheless falls short of human–human reliability; modest item\-level alignment in larger models; and heterogeneous trial\-level alignment that depended on both task and model size\. Thus, current models show only partial alignment with human cognitive development\.
Some differences between models and children that we observed plausibly stem from the well\-documented weaknesses of current VLMs in spatial cognition \(e\.g\.,\[[42](https://arxiv.org/html/2606.05497#bib.bib301),[27](https://arxiv.org/html/2606.05497#bib.bib297)\]\)\. For example, the failures of models in mental rotation tasks are not intrinsic to all vision models, as special\-purpose networks can be highly successful\[[29](https://arxiv.org/html/2606.05497#bib.bib6)\]\. Similarly, while multiplication and fraction trials in the math task were hard for humans, counting visual arrays was the most challenging for models\[[33](https://arxiv.org/html/2606.05497#bib.bib1)\]\(see Appendix[B\.2](https://arxiv.org/html/2606.05497#A2.SS2)\)\. Thus, developing better cognitive models will likely require vision backbones with stronger cognitive abilities\[[1](https://arxiv.org/html/2606.05497#bib.bib289),[17](https://arxiv.org/html/2606.05497#bib.bib7)\]\.
### 7\.1Limitations and future directions
Because of the diversity of tasks, wide range of ages, and cross\-linguistic diversity, LEVANTE\-bench constitutes a substantial advance over previous developmental benchmarks\. Nevertheless, it still has substantial limitations\. The current dataset only includes data from six tasks and three languages\. Work in progress seeks to increase the set of LEVANTE languages and to include tasks that span other key cognitive domains\. Further, included tasks have a relatively small number of items \(ranging from dozens in theory of mind to several hundred in math\); this item set should be expanded to increase benchmark precision\. Because LEVANTE is an open\-source project, items are available on GitHub; despite its non\-commercial license, it may still be included in the training data of some recent models, leading to possible contamination\. Because our results do not center on absolute performance comparisons, we do not view this as a critical issue\.
Scaling LEVANTE\-bench can be challenging as analyses of trial\-level alignment require repeated sampling across models\. While we generated at least 10 responses for all models, more precise distributional estimates require substantial computational resources relative to conventional single\-pass benchmarks\. Finally, we made a number of procedural decisions in preparing benchmark materials \(including prompt content, image size, and output parsing\)\. We attempted to make choices that limited worst\-case behavior by smaller models \(minimizing “task demands” per\[[20](https://arxiv.org/html/2606.05497#bib.bib8)\]\), but it is possible that specific optimizations could improve individual model families’ performance\.
Finally, our eventual goal is to use our benchmark to help answer questions about human learning\. For example, human alignment scores could be used to measure how VLMs’ representations change with training or to compare VLMs with cognitively\-relevant differences in architecture or training data\. Achieving this goal depends on access to models that can be analyzed in this way, however\. We did not have access to generative VLMs trained on data from children \(cf\.\[[49](https://arxiv.org/html/2606.05497#bib.bib184),[51](https://arxiv.org/html/2606.05497#bib.bib291)\]\), limiting our ability to make direct comparisons between humans and models trained on human data\. Further, to our knowledge, no prominent VLM families have released checkpoint data, which would allow us to analyze alignment across training within a single model \(as was done in\[[45](https://arxiv.org/html/2606.05497#bib.bib256)\]\)\.
### 7\.2Conclusions
Language models promise to provide an important tool for understanding human learning and cognition\[[14](https://arxiv.org/html/2606.05497#bib.bib14),[8](https://arxiv.org/html/2606.05497#bib.bib4),[30](https://arxiv.org/html/2606.05497#bib.bib5)\]\. Fulfilling this promise, however, will require models that learn from human\-scale input\[[16](https://arxiv.org/html/2606.05497#bib.bib192),[49](https://arxiv.org/html/2606.05497#bib.bib184),[52](https://arxiv.org/html/2606.05497#bib.bib235)\]and that provide a good match to human learning trajectories from early childhood to adulthood\. Benchmarks play a critical role in this process by furnishing the field with measurement instruments to better assess progress in model development\. In particular, LEVANTE\-bench allows for model–human comparisons at multiple scales – cross\-task, within\-task, and trial\-level error alignments – using tasks that have been psychometrically validated in humans and have parallel versions in multiple languages\. We hope that the current work will thus provide an improved measure with which to assess some of our progress towards models of the human mind\.
## Acknowledgments and Disclosure of Funding
This work, including development and data collection for the LEVANTE core tasks, was supported by the Jacobs Foundation\. Gemini credits were provided through a gift from Google, Inc\. Some of the computing for this project was performed on the Marlowe cluster\. We would like to thank Stanford University and Stanford Research Computing for providing computational resources and support that contributed to these research results, and Mika Braginsky for help with data processing\.
#### Author contributions\.
*Conceptualization*: AWMT, DC, TLB, LBS, MCF;*Data curation*: AWMT, MCF;*Formal analysis*: AWMT, TLB, SY;*Funding acquisition*: MCF;*Investigation*: AWMT, DC, TLB, LBS, MCF;*Methodology*: AWMT, DC, TLB, LBS, MCF;*Resources*: DC, MCF;*Software*: AWMT, DC, TLB, LBS, SY, MCF;*Supervision*: MCF;*Visualization*: AWMT, TLB, SY;*Writing – original draft*: AWMT, MCF;*Writing – review & editing*: AWMT, DC, TLB, LBS, SY, MCF\.
#### Artificial intelligence disclosure\.
*Artificial intelligence tools*: ChatGPT, Claude Code, Claude Cowork, Cursor, Gemini 2\.5 Pro, and Perplexity;*Data collection methods*: Gemini 2\.5 Pro was used to machine translate the task prompts into German and Spanish;*Execution*: ChatGPT, Claude Code, and Cursor were used to assist in code generation for running experiments;*Writing – original draft*: Cursor was used to help with writing Appendix[A](https://arxiv.org/html/2606.05497#A1);*Writing – review & editing*: Claude Cowork and Perplexity were used to review the manuscript for coherence and clarity\.
## References
- \[1\]K\. L\. Aw, K\. Kotar, W\. Lee, S\. Kim, K\. Jedoui, R\. Venkatesh, L\. N\. Chen, M\. C\. Frank, and D\. L\. Yamins\(2026\)Zero\-shot world models are developmentally efficient learners\.External Links:2604\.10333,[Link](https://arxiv.org/abs/2604.10333)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1),[§7](https://arxiv.org/html/2606.05497#S7.p2.1)\.
- \[2\]\(2025\)Fast and robust visual object recognition in young children\.Science Advances11\(27\),pp\. eads6821\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1)\.
- \[3\]M\. Binz and E\. Schulz\(2023\)Using cognitive psychology to understand GPT\-3\.Proceedings of the National Academy of Sciences120\(6\),pp\. e2218523120\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1),[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[4\]D\. E\. Blasi, J\. Henrich, E\. Adamou, D\. Kemmerer, and A\. Majid\(2022\)Over\-reliance on English hinders cognitive science\.Trends in cognitive sciences26\(12\),pp\. 1153–1170\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p4.1)\.
- \[5\]R\. D\. Bock and M\. F\. Zimowski\(1997\)Multiple group irt\.InHandbook of modern item response theory,pp\. 433–448\.Cited by:[footnote 4](https://arxiv.org/html/2606.05497#footnote4)\.
- \[6\]X\. Cao, Y\. Shen, B\. Lai, W\. Ye, Y\. Ma, J\. Heintz, J\. Chen, M\. Huang, J\. Cao, A\. Zhang, and J\. M\. Rehg\(2025\)What is the visual cognition gap between humans and multimodal llms?\.External Links:2406\.10424,[Link](https://arxiv.org/abs/2406.10424)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p3.1)\.
- \[7\]L\. Chen, W\. Xie, Y\. Liang, H\. He, H\. Zhao, Z\. Yang, Z\. Huang, H\. Wu, H\. Lu, Y\. charles, Y\. Bao, Y\. Fan, G\. Li, H\. Shen, X\. Chen, W\. Xu, S\. Si, Z\. Cai, W\. Chai, Z\. Huang, F\. Liu, T\. Liu, B\. Chang, X\. Hu, K\. Chen, Y\. Ren, Y\. Liu, Y\. Gong, and K\. Li\(2026\)BabyVision: visual reasoning beyond language\.External Links:2601\.06521,[Link](https://arxiv.org/abs/2601.06521)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p3.1),[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
- \[8\]L\. Connell and D\. Lynott\(2024\)What can language models tell us about human cognition?\.Current Directions in Psychological Science33\(3\),pp\. 181–189\.Cited by:[§7\.2](https://arxiv.org/html/2606.05497#S7.SS2.p1.1)\.
- \[9\]R\. Dominguez\-Olmedo, M\. Hardt, and C\. Mendler\-Dünner\(2024\)Questioning the survey responses of large language models\.Advances in Neural Information Processing Systems37,pp\. 45850–45878\.Cited by:[§5](https://arxiv.org/html/2606.05497#S5.SS0.SSS0.Px2.p1.1)\.
- \[10\]J\. L\. Elman\(1996\)Rethinking innateness: a connectionist perspective on development\.Vol\.10,MIT press\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[11\]S\. E\. Embretson and S\. P\. Reise\(2013\)Item response theory for psychologists\.Psychology Press\.Cited by:[§4](https://arxiv.org/html/2606.05497#S4.p1.5)\.
- \[12\]A\. Fernald, R\. Zangl, A\. L\. Portillo, and V\. A\. Marchman\(2008\)Looking while listening\.Language acquisition and language disorders,pp\. 97–135\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[13\]M\. C\. Frank, H\. A\. Baumgartner, M\. Braginsky, G\. Kachergis, A\. A\. Lightbody, R\. Z\. Sparks, R\. Zhu, S\. M\. Carlson, S\. Graham, S\. J\. Lipina,et al\.\(2025\)Learning Variability Network Exchange \(LEVANTE\): A global framework for measuring children’s learning variability through collaborative data sharing\.Child development96\(6\),pp\. 1867–1884\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p5.1),[§3](https://arxiv.org/html/2606.05497#S3.p1.1),[footnote 7](https://arxiv.org/html/2606.05497#footnote7)\.
- \[14\]M\. C\. Frank and N\. D\. Goodman\(2025\)Cognitive modeling using artificial intelligence\.Annual Review of Psychology77\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1),[§7\.2](https://arxiv.org/html/2606.05497#S7.SS2.p1.1)\.
- \[15\]M\. C\. Frank\(2023\)Baby steps in evaluating the capacities of large language models\.Nature Reviews Psychology2\(8\),pp\. 451–452\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[16\]M\. C\. Frank\(2023\)Bridging the data gap between children and large language models\.Trends in Cognitive Sciences27\(11\),pp\. 990–992\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1),[§7\.2](https://arxiv.org/html/2606.05497#S7.SS2.p1.1)\.
- \[17\]Q\. Garrido, N\. Ballas, M\. Assran, A\. Bardes, L\. Najman, M\. Rabbat, E\. Dupoux, and Y\. LeCun\(2025\)Intuitive physics understanding emerges from self\-supervised pretraining on natural videos\.arXiv preprint arXiv:2502\.11831\.Cited by:[§7](https://arxiv.org/html/2606.05497#S7.p2.1)\.
- \[18\]R\. Gershon, M\. A\. Novack, and A\. J\. Kaat\(2024\)The NIH Infant and Toddler Toolbox: a new standardized tool for assessing neurodevelopment in children ages 1–42 months\.Child Development95\(6\),pp\. 2252–2254\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
- \[19\]L\. Hewitt, A\. Ashokkumar, I\. Ghezae, and R\. Willer\(2024\)Predicting results of social science experiments using large language models\.Preprint\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[20\]J\. Hu and M\. C\. Frank\(2024\)Auxiliary task demands mask the capabilities of smaller language models\.arXiv preprint arXiv:2404\.02418\.Cited by:[§7\.1](https://arxiv.org/html/2606.05497#S7.SS1.p2.1)\.
- \[21\]J\. Hu, F\. Sosa, and T\. Ullman\(2025\)Re\-evaluating theory of mind evaluation in large language models\.Philosophical Transactions of the Royal Society B: Biological Sciences380\(1932\)\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[22\]L\. S\. Huber, R\. Geirhos, and F\. A\. Wichmann\(2023\)The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks\.Journal of vision23\(7\),pp\. 4–4\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1)\.
- \[23\]G\. Jiang, M\. Xu, S\. Xin, W\. Liang, Y\. Peng, C\. Zhang, and Y\. Zhu\(2023\-23–29 Jul\)MEWL: few\-shot multimodal word learning with referential uncertainty\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 15144–15169\.External Links:[Link](https://proceedings.mlr.press/v202/jiang23i.html)Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
- \[24\]G\. Kachergis, F\. O’Reilly, M\. Braginsky, X\. Xiao, A\. Lightbody, K\. Shannon, Z\. Watson, L\. Zhang, R\. Zhu, A\. Abutto,et al\.\(2025\)Creation and validation of the LEVANTE core tasks: Internationalized measures of learning and development for children ages 5\-12 years\.External Links:[Link](https://doi.org/10.31234/osf.io/r4dhw_v1)Cited by:[Figure 1](https://arxiv.org/html/2606.05497#S1.F1),[§1](https://arxiv.org/html/2606.05497#S1.p5.1),[§3](https://arxiv.org/html/2606.05497#S3.p1.1),[§3](https://arxiv.org/html/2606.05497#S3.p3.3),[footnote 4](https://arxiv.org/html/2606.05497#footnote4)\.
- \[25\]M\. Kosinski\(2024\)Evaluating large language models in theory of mind tasks\.Proceedings of the National Academy of Sciences121\(45\),pp\. e2405460121\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[26\]B\. Li\(2026\)Incompressible knowledge probes: estimating black\-box LLM parameter counts via factual capacity\.External Links:2604\.24827,[Link](https://arxiv.org/abs/2604.24827)Cited by:[footnote 5](https://arxiv.org/html/2606.05497#footnote5)\.
- \[27\]Y\. Li, Q\. Gao, T\. Zhao, B\. Wang, H\. Sun, H\. Lyu, R\. D\. Hawkins, N\. Vasconcelos, T\. Golan, D\. Luo, and H\. Deng\(2025\)Core knowledge deficits in multi\-modal language models\.External Links:2410\.10855,[Link](https://arxiv.org/abs/2410.10855)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p3.1),[§2](https://arxiv.org/html/2606.05497#S2.p2.1),[§7](https://arxiv.org/html/2606.05497#S7.p2.1)\.
- \[28\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[29\]S\. R\. Mason, A\. Gjølbye, P\. C\. Højbjerg, L\. Tětková, and L\. K\. Hansen\(2026\)Large vision models can solve mental rotation problems\.External Links:2509\.15271,[Link](https://arxiv.org/abs/2509.15271)Cited by:[§7](https://arxiv.org/html/2606.05497#S7.p2.1)\.
- \[30\]S\. W\. McGrath, J\. Russin, E\. Pavlick, and R\. Feiman\(2024\)How can deep neural networks inform theory in psychological science?\.Current directions in psychological science33\(5\),pp\. 325–333\.Cited by:[§7\.2](https://arxiv.org/html/2606.05497#S7.SS2.p1.1)\.
- \[31\]I\. R\. McKenzie, A\. Lyzhov, M\. Pieler, A\. Parrish, A\. Mueller, A\. Prabhu, E\. McLean, A\. Kirtland, A\. Ross, A\. Liu,et al\.\(2023\)Inverse scaling: when bigger isn’t better\.arXiv preprint arXiv:2306\.09479\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[32\]E\. Pavlick\(2023\)Symbols and grounding in large language models\.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences381\(2251\)\.External Links:ISSN 1471\-2962,[Link](http://dx.doi.org/10.1098/rsta.2022.0041),[Document](https://dx.doi.org/10.1098/rsta.2022.0041)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1)\.
- \[33\]M\. F\. Qharabagh, M\. Ghofrani, and K\. Fountoulakis\(2026\)LVLM\-count: enhancing the counting ability of large vision\-language models\.External Links:2412\.00686,[Link](https://arxiv.org/abs/2412.00686)Cited by:[§7](https://arxiv.org/html/2606.05497#S7.p2.1)\.
- \[34\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[35\]Y\. Ruan, C\. J\. Maddison, and T\. Hashimoto\(2024\-10\)Observational scaling laws and the predictability of language model performance\.arXiv\.External Links:2405\.10938,[Document](https://dx.doi.org/10.48550/arXiv.2405.10938)Cited by:[§6\.1](https://arxiv.org/html/2606.05497#S6.SS1.p1.1)\.
- \[36\]D\. E\. Rumelhart, J\. L\. McClelland, P\. R\. Group,et al\.\(1986\)Parallel distributed processing, volume 1: explorations in the microstructure of cognition: foundations\.The MIT press\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1)\.
- \[37\]L\. M\. Schulze Buschoff, E\. Akata, M\. Bethge, and E\. Schulz\(2025\)Visual cognition in multimodal large language models\.Nature Machine Intelligence7\(1\),pp\. 96–106\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1),[§1](https://arxiv.org/html/2606.05497#S1.p3.1)\.
- \[38\]R\. S\. Shah, K\. Bhardwaj, and S\. Varma\(2024\)Development of cognitive intelligence in pre\-trained language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 9632–9657\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[39\]S\. Sheybani, L\. Smith, Z\. Tiganj, S\. Maini, and A\. Dendukuri\(2024\)ModelVsBaby: a developmentally motivated benchmark of out\-of\-distribution object recognition\.Preprint at https://osf\. io/preprints/psyarxiv/83gae\_v1\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
- \[40\]H\. A\. Simon\(1980\)Cognitive science: the newest science of the artificial\.Cognitive science4\(1\),pp\. 33–46\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1)\.
- \[41\]L\. Singh, M\. Casillas, S\. Allen, M\. Frank, and C\. Rowland\(2026\)Innateness is \(still\) an orienting principle for language development\.PsyArXiv\.External Links:[Link](https://osf.io/preprints/psyarxiv/ykz8j_v1)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[42\]I\. Stogiannidis, S\. McDonagh, and S\. A\. Tsaftaris\(2025\)Mind the gap: benchmarking spatial reasoning in vision\-language models\.External Links:2503\.19707,[Link](https://arxiv.org/abs/2503.19707)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p3.1),[§6\.1](https://arxiv.org/html/2606.05497#S6.SS1.p1.1),[§7](https://arxiv.org/html/2606.05497#S7.p2.1)\.
- \[43\]B\. Sturgeon and L\. Chan\(2026\-04\)Sanity\-checking “incompressible knowledge probes”\.External Links:[Link](https://www.lesswrong.com/posts/veFMEzDDyWaer2Sms/sanity-checking-incompressible-knowledge-probes)Cited by:[footnote 5](https://arxiv.org/html/2606.05497#footnote5)\.
- \[44\]A\. W\. M\. Tan, J\. Yang, T\. Sepuri, K\. L\. Aw, R\. Z\. Sparks, Z\. Yin, V\. A\. Marchman, M\. C\. Frank, and B\. Long\(2025\)Assessing the alignment between infants’ visual and linguistic experience using multimodal language models\.arXiv preprint arXiv:2511\.18824\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p3.1)\.
- \[45\]A\. W\. M\. Tan, S\. Yu, B\. Long, W\. A\. Ma, T\. Murray, R\. D\. Silverman, J\. D\. Yeatman, and M\. C\. Frank\(2025\-01\)DevBench: A multimodal developmental benchmark for language learning\.InAdvances in Neural Information Processing Systems,Vol\.37,Vancouver, BC,pp\. 77445–77467\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p4.1),[§2](https://arxiv.org/html/2606.05497#S2.p2.1),[§4](https://arxiv.org/html/2606.05497#S4.SS0.SSS0.Px3.p1.5),[§6\.4](https://arxiv.org/html/2606.05497#S6.SS4.p2.4),[§7\.1](https://arxiv.org/html/2606.05497#S7.SS1.p3.1)\.
- \[46\]S\. Truong, N\. Goodman, E\. Brunskill, B\. Domingue, N\. Haber, and S\. Koyejo\(2026\)A measurement science roadmap: from human assessment to ai evaluation\.Cited by:[§4](https://arxiv.org/html/2606.05497#S4.p1.5)\.
- \[47\]S\. Truong, Y\. Tu, M\. Hardy, A\. Reuel, Z\. Tang, J\. Burapacheep, J\. Perera, C\. Uwakwe, B\. Domingue, N\. Haber, and S\. Koyejo\(2025\)Fantastic bugs and where to find them in ai benchmarks\.External Links:2511\.16842,[Link](https://arxiv.org/abs/2511.16842)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p4.1)\.
- \[48\]W\. K\. Vong and B\. M\. Lake\(2025\)On the robustness of modeling grounded word learning through a child’s egocentric input\.arXiv preprint arXiv:2507\.14749\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1)\.
- \[49\]W\. K\. Vong, W\. Wang, A\. E\. Orhan, and B\. M\. Lake\(2024\)Grounded language acquisition through the eyes and ears of a single child\.Science383\(6682\),pp\. 504–511\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1),[§7\.1](https://arxiv.org/html/2606.05497#S7.SS1.p3.1),[§7\.2](https://arxiv.org/html/2606.05497#S7.SS2.p1.1)\.
- \[50\]S\. Wang, A\. Chandra, A\. Liu, V\. Saligrama, and B\. Gong\(2025\)BabyVLM: data\-efficient pretraining of vlms inspired by infant learning\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1380–1390\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1),[§1](https://arxiv.org/html/2606.05497#S1.p3.1),[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
- \[51\]S\. Wang, W\. Wang, Z\. Wang, M\. Whitton, M\. Wakeham, A\. Chandra, J\. Huang, P\. Zhu, H\. Chen, D\. Li,et al\.\(2025\)BabyVLM\-V2: toward developmentally grounded pretraining and benchmarking of vision foundation models\.arXiv preprint arXiv:2512\.10932\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p2.1),[§7\.1](https://arxiv.org/html/2606.05497#S7.SS1.p3.1)\.
- \[52\]A\. Warstadt, L\. Choshen, A\. Mueller, A\. Williams, E\. Wilcox, and C\. Zhuang\(2023\)Call for papers—The BabyLM challenge: sample\-efficient pretraining on a developmentally plausible corpus\.arXiv preprint arXiv:2301\.11796\.Cited by:[§7\.2](https://arxiv.org/html/2606.05497#S7.SS2.p1.1)\.
- \[53\]A\. Warstadt, A\. Mueller, L\. Choshen, E\. Wilcox, C\. Zhuang, J\. Ciro, R\. Mosquera, B\. Paranjabe, A\. Williams, T\. Linzen,et al\.\(2023\)Findings of the BabyLM challenge: sample\-efficient pretraining on developmentally plausible corpora\.InProceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning,Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1)\.
- \[54\]T\. Webb, K\. J\. Holyoak, and H\. Lu\(2023\)Emergent analogical reasoning in large language models\.Nature Human Behaviour7\(9\),pp\. 1526–1541\.Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p1.1),[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[55\]L\. Weihs, A\. Yuile, R\. Baillargeon, C\. Fisher, G\. Marcus, R\. Mottaghi, and A\. Kembhavi\(2022\)Benchmarking progress to infant\-level physical reasoning in ai\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
- \[56\]H\. M\. Wellman, D\. Cross, and J\. Watson\(2001\)Meta\-analysis of theory\-of\-mind development: the truth about false belief\.Child development72\(3\),pp\. 655–684\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[57\]W\. Xie, Z\. Wang, S\. Ma, X\. Sun, K\. Chen, E\. Wang, W\. Liu, and H\. Tong\(2026\)AIPsychoBench: understanding the psychometric differences between llms and humans\.Topics in Cognitive Science18\(2\),pp\. e70041\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[58\]Y\. Xie, Q\. Mei, W\. Yuan, and M\. O\. Jackson\(2025\)Using large language models to categorize strategic situations and decipher motivations behind human behaviors\.Proceedings of the National Academy of Sciences122\(35\),pp\. e2512075122\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[59\]E\. Yiu, E\. Kosoy, and A\. Gopnik\(2024\)Transmission versus truth, imitation versus innovation: what children can do that large language and language\-and\-vision models cannot \(yet\)\.Perspectives on Psychological Science19\(5\),pp\. 874–883\.Cited by:[§2](https://arxiv.org/html/2606.05497#S2.p1.1)\.
- \[60\]E\. Yiu, M\. Qraitem, A\. N\. Majhi, C\. Wong, Y\. Bai, S\. Ginosar, A\. Gopnik, and K\. Saenko\(2025\)KiVA: kid\-inspired visual analogies for testing large multimodal models\.External Links:2407\.17773,[Link](https://arxiv.org/abs/2407.17773)Cited by:[§1](https://arxiv.org/html/2606.05497#S1.p3.1),[§2](https://arxiv.org/html/2606.05497#S2.p2.1)\.
## Appendix APrompt sensitivity analysis
We conducted a systematic exploration of prompt design across all six tasks and multiple model families before selecting the default prompt for the main experiments\. The goal was to verify that our results reflect model capabilities rather than artifacts of a particular prompt phrasing\. All notebooks and raw output are available in our supplementary materials\.
#### Methodology\.
For each task we defined a set of prompt*phases*that could be applied independently or in combination:Phase 0\(baseline\): the task stimulus followed by a JSON output instruction;Phase 1\(structured layout\): reformatted multiline text with labeled option blocks;Phase 2\(enhanced parsing\): parser\-side improvements only \(no prompt change\);Phase 3\(task system prompt\): a task\-specific system message \(e\.g\., “You are a visual vocabulary expert”\);Phase 4\(task\-specific hints\): domain knowledge such as mirror/chirality hints for mental rotation or distractor\-awareness cues for vocabulary;Phase 5\(chain\-of\-thought\): step\-by\-step reasoning instructions, sometimes with increased token budgets \(512–1024 tokens\)\. All phases were tested individually and in combinations up to the full stack\.
#### Cross\-model sweep\.
A dedicated robustness sweep tested 5 models \(Qwen 0\.8B, InternVL 1B, SmolVLM2 2\.2B, InternVL 4B, and Gemma3 4B\) across all 6 tasks using 4 task\-framing variants×\\times2 output\-format variants \(bare letter vs\. JSON\), for a total of 240 model–task–prompt cells\. The overall mean accuracy was 46\.9% \(bare\) vs\. 45\.7% \(JSON\), and mean parse rates were 98\.1% vs\. 96\.8%, indicating that the output format choice has a small effect\. A maximin analysis–selecting the prompt that maximizes the worst\-case accuracy across models–favored the*minimal framing with JSON output*\(TF1×\\timesOF2\) on 4 of 6 tasks\.
#### Per\-task findings\.
Table[1](https://arxiv.org/html/2606.05497#A1.T1)summarizes the key results\.
Table 1:Prompt sensitivity by task: accuracy range observed across prompt strategies for representative models\.Δmax\\Delta\_\{\\max\}is the spread between the best and worst prompt configuration\.Several patterns emerged consistently:
1. 1\.CoT hurts small models\.For sub\-2B models, chain\-of\-thought instructions reduced both accuracy and parse rate on vocabulary \(−\-23\.5 pp\), sentence understanding \(−\-10\.1 pp\), and matrix reasoning \(−\-7\.6 pp\)\. Extended reasoning consumed the token budget without producing a parseable answer\.
2. 2\.Prompt gains are model\-specific\.The expert system prompt that boosted InternVL 8B on matrix reasoning by \+12\.7 pp*decreased*InternVL 2B accuracy by−\-20\.3 pp on the same task\. Similarly, the describe\-first strategy that helped Qwen 2B on sentence understanding \(\+13\.1 pp\) was not effective for the 0\.8B variant\.
3. 3\.Mental rotation resists prompting\.Across Qwen 0\.8B, InternVL 2B, InternVL 8B, and three spatial fine\-tuned models \(SpaceThinker, SpaceOm, SpatialThinker\), no prompt strategy reliably exceeded chance after controlling for position bias via answer\-permutation debiasing\. The apparent best result \(62\.7% via self\-consistency on Qwen2\.5\-VL\-3B\) was not significant under a bias\-aware null model \(p≈0\.183p\\approx 0\.183\)\.
4. 4\.Spatial fine\-tuning does not help\.Three models fine\-tuned for spatial reasoning \(SpaceThinker, SpaceOm, SpatialThinker\-Oxford\) all scored 59\.0% with the baseline elimination prompt—identical to the position\-biased ceiling—and dropped to 38\.6–48\.2% with their recommended paper prompts, which elicit longer reasoning chains\.
5. 5\.Stacking phases has diminishing or negative returns\.Full\-stack combinations often underperformed the best single phase\. In sentence understanding \(0\.8B\), combining all phases yielded 35\.4% vs\. 43\.4% for the structural stack alone\. In vocabulary, the full stack was the exception that improved over individual phases \(\+12\.4 pp\), driven by synergistic interactions between structure and distractor awareness\.
#### Justification for the default prompt\.
Given these results, we opted for the minimal JSON\-based prompt as the default for main experiments\. This configuration \(i\) achieves competitive accuracy across all model families in maximin analysis, \(ii\) maintains parse rates above 95% for all models except the smallest, and \(iii\) enables a fair, model\-agnostic comparison without introducing prompt\-induced variance that could confound cross\-model and cross\-task analyses\.
## Appendix BAdditional results
### B\.1Full KL divergence results
Figure 7:DKLD\_\{\\textup\{\\small KL\}\}between model and human response distributions plotted by log10number of parameters for all tasks\. Lower values indicate greater model–human alignment\.Figure 8:DKLD\_\{\\textup\{\\small KL\}\}between model and human response distributions plotted by human ability bins for all tasks\. Lower values indicate greater model–human alignment\.Figures[7](https://arxiv.org/html/2606.05497#A2.F7)and[8](https://arxiv.org/html/2606.05497#A2.F8)show the fullDKLD\_\{\\textup\{\\small KL\}\}distributions for all models across all ability bins for all tasks\. Math, sentence understanding, and theory of mind show similar trends, with larger models being more similar to higher ability humans and vice versa\. Vocabulary and matrix reasoning showDKLD\_\{\\textup\{\\small KL\}\}distributions that depend more on the specific model, while mental rotation shows collapse across models with very limited differentiation\.
### B\.2Item subtype analysis
Figure 9:Accuracy deviation from overall task accuracy\. Black dots and error bars indicate all\-model means and bootstrapped 95% confidence intervals\. Dotted line indicates zero\. Human accuracy estimates are not yet available for some item subtypes\. MR: mental rotation; ToM: theory of mind\.To understand the distribution of model accuracies across items, we further examined model performance for tasks with item subtypes\. We calculated each model’s accuracy for each item subtype \(for tasks with subtypes\), then calculated the difference between subtype accuracy and overall task accuracy\.
This analysis revealed areas of divergence between models and humans\. For the math task, humans were comparatively better at addition and missing number tasks relative to models, whereas they were comparatively much worse at multiplication and fraction questions\. For mental rotation, humans found 2D shapes easier than 3D shapes, whereas the reverse was true for models\. For theory of mind, humans showed a relative ordering between reality check, emotion reasoning, and false believe questions, whereas models were broadly similar on these three subtypes\. Interestingly, sentence understanding showed relatively similar deviations in subtype accuracy between models and humans\. These disparities suggest that items may function differentially between models and humans, especially in reasoning domains\.Similar Articles
Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
BloomBench is a cognitively grounded bilingual (English-Arabic) multimodal benchmark for Vision-Language Models, systematically evaluating six cognitive levels based on Bloom's Taxonomy. Experiments reveal significant cognitive asymmetries and cross-lingual performance gaps in current models.
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.