Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

arXiv cs.AI Papers

Summary

This paper dissociates difficulty registration from deliberation allocation in large reasoning models (LRMs) and humans, finding that LRMs spend more tokens on problems they get wrong while humans spend less time on failures, revealing opposite within-item patterns despite similar cross-item difficulty correlations.

arXiv:2606.26502v1 Announce Type: new Abstract: Large reasoning models (LRMs) take longer on harder problems, just as humans do. This surface similarity hides an opposite pattern within items. When an LRM gets a problem wrong, it spends more tokens than when it gets the same problem right; humans do the reverse, spending less time on the trials they get wrong. We separate two levels of deliberation: how response time tracks difficulty across items (registration), and, with item identity held fixed, whether an agent spends more on its own failures or successes (allocation). On a public matched human-LRM corpus, humans and all five thinking LRMs reproduce the known cross-item alignment (registration) but diverge within items (allocation): every LRM shows a large wrong-vs-right effect (Cohen's d = 1.47-3.13 on H-ARC) while humans show the opposite sign. The comparison stays inside each agent's own scale; we never put seconds and tokens on one axis. The dissociation holds under item fixed effects, replicates across datasets, and is absent in a non-thinking baseline. We read the human pattern as engagement versus abandonment: people stay on items they expect to solve and give up on the rest. We read the LRM pattern as length driven by uncertainty: chains grow when the model is unsure, which is exactly when it tends to fail. Both policies produce the same cross-item correlation with difficulty, so they look aligned on the measure prior work has used; the divergence shows up only once item identity is fixed. Under resource-rational metareasoning, the split is between two stopping policies that share a difficulty signal but implement opposite control; trace length captures the signal and misses the control.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:13 AM

# Separating Difficulty Registration from Deliberation Allocation
Source: [https://arxiv.org/html/2606.26502](https://arxiv.org/html/2606.26502)
## Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

###### Abstract

Large reasoning models \(LRMs\) take longer on harder problems, just as humans do\. This surface similarity hides an opposite pattern within items\. When an LRM gets a problem wrong, it spends more tokens than when it gets the same problem right; humans do the reverse, spending less time on the trials they get wrong\. We separate two levels of deliberation: how response time tracks difficulty across items \(registration\), and, with item identity held fixed, whether an agent spends more on its own failures or successes \(allocation\)\. On a public matched human–LRM corpus, humans and all five thinking LRMs reproduce the known cross\-item alignment \(registration\) but diverge within items \(allocation\): every LRM shows a large wrong\-vs\-right effect \(Cohen’sd=1\.47d=1\.47–3\.133\.13on H\-ARC\) while humans show the opposite sign\. The comparison stays inside each agent’s own scale; we never put seconds and tokens on one axis\. The dissociation holds under item fixed effects, replicates across datasets, and is absent in a non\-thinking baseline\. We read the human pattern as engagement versus abandonment: people stay on items they expect to solve and give up on the rest\. We read the LRM pattern as length driven by uncertainty: chains grow when the model is unsure, which is exactly when it tends to fail\. Both policies produce the same cross\-item correlation with difficulty, so they look aligned on the measure prior work has used; the divergence shows up only once item identity is fixed\. Under resource\-rational metareasoning, the split is between two stopping policies that share a difficulty signal but implement opposite control; trace length captures the signal and misses the control\.

Keywords:large reasoning models, deliberation, reaction time, chain\-of\-thought, resource\-rational analysis

## 1Introduction

A central question in the cognitive science of deliberation is how thinkers decide when an item is worth more thought, and when it is not\. Resource\-rational analysis casts the decision as a comparison of expected gain to opportunity cost\(Simon,[1956](https://arxiv.org/html/2606.26502#bib.bib29); Russell and Wefald,[1991](https://arxiv.org/html/2606.26502#bib.bib2); Lieder and Griffiths,[2020](https://arxiv.org/html/2606.26502#bib.bib1); Gershmanet al\.,[2015](https://arxiv.org/html/2606.26502#bib.bib30); Callawayet al\.,[2022](https://arxiv.org/html/2606.26502#bib.bib33)\)\. Metacognitive theories locate it in confidence\-based monitoring and disengagement\(Nelson and Narens,[1990](https://arxiv.org/html/2606.26502#bib.bib27); Yeung and Summerfield,[2012](https://arxiv.org/html/2606.26502#bib.bib28); Nelson and Leonesio,[1988](https://arxiv.org/html/2606.26502#bib.bib35)\)\. Sequential\-sampling models locate it at the boundary that terminates evidence accumulation\(Ratcliff and McKoon,[2008](https://arxiv.org/html/2606.26502#bib.bib21); Bogaczet al\.,[2006](https://arxiv.org/html/2606.26502#bib.bib3); Heitz,[2014](https://arxiv.org/html/2606.26502#bib.bib22)\)\. All three traditions converge on a two\-step picture\. An agent firstregistersthat an item is hard — a perceptual or evaluative difficulty signal\. The agent thenallocatescomputation around that registered difficulty: a stopping or scheduling rule that decides whether to keep thinking on the item in front of it\. Difficulty registration and deliberation allocation are dissociable: an agent can grade items by difficulty in the same order as another agent yet schedule its own computation on a different policy\.

Large reasoning models \(LRMs\) make this decomposition empirically tractable\. They extend the chain\-of\-thought paradigm of producing intermediate reasoning before a final answer\(Weiet al\.,[2022](https://arxiv.org/html/2606.26502#bib.bib23); Kojimaet al\.,[2022](https://arxiv.org/html/2606.26502#bib.bib24); Snellet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib25)\), and the cognitive\-psychology\-style analysis of language models has begun mapping their behavioural signatures against human ones\(Binz and Schulz,[2023](https://arxiv.org/html/2606.26502#bib.bib34)\)\. Like a human reaction time, an LRM’s reasoning\-trace length is a behavioural readout of how long the agent dwells on a problem; the two measures are analogous, not interchangeable\.de Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)reported a striking alignment on this measure: across seven reasoning tasks, LRM reasoning\-token length tracks human reaction time both within and across paradigms\. Items that take humans longer also elicit longer LRM traces\. That result is a clean demonstration of difficulty registration\. It is silent on allocation: cross\-item correlations describe how the two systems grade items relative to one another, not how either system schedules computation around its own successes and failures\.

The published commentaries onde Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)press exactly this gap\.Vankovet al\.\([2026](https://arxiv.org/html/2606.26502#bib.bib7)\)tested the causal interpretation through reasoning\-effort manipulations and found minimal accuracy effects in five of six tasks, arguing that the correlation alone does not license alignment claims\.Dujmović \([2026](https://arxiv.org/html/2606.26502#bib.bib8)\)emphasised that correlation does not establish mechanistic similarity and called for testable hypotheses about shared mechanisms\.Hu \([2026](https://arxiv.org/html/2606.26502#bib.bib9)\)raised the alternative that intermediate tokens may function as performative scaffolding rather than incremental internal computation, and proposed inference\-time truncation as a discriminating diagnostic\. The authors’ reply to Dujmović\(de Vardaet al\.,[2026a](https://arxiv.org/html/2606.26502#bib.bib6)\)explicitly disavows algorithmic\-level mechanistic claims, framing the finding as “a robust empirical phenomenon” whose underlying explanation remains open; their reply to Vankov\(de Vardaet al\.,[2026b](https://arxiv.org/html/2606.26502#bib.bib5)\)defended the alignment claim with additional 33\-task evidence and noted that H\-ARC, the primary paradigm of the present study, is the one task in which manipulating reasoning effort substantially changes accuracy, but it did not address within\-item allocation\. The literature has thus converged on three points: cross\-item alignment is real, it is not by itself a mechanism claim, and a separate diagnostic is needed to test how deliberation is allocated\.

We supply that diagnostic\. The outcome\-conditioned question is simple: does the agent spend more deliberation on trials it answers correctly than on trials it answers incorrectly? We ask it for each agent on each paradigm, normalising deliberation within each agent’s own scale so that seconds and tokens are never compared directly, and we then use item fixed effects to test the human–LRM interaction\. Each agent serves as its own baseline, and item fixed effects absorb the cross\-item difficulty gradient that the originalde Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)finding lives on\. The resulting slope is a diagnostic of deliberationallocationthat is orthogonal to, and complements, the cross\-item difficulty\-registration diagnostic \(Figure[1](https://arxiv.org/html/2606.26502#S3.F1)\)\. We use “within\-agent” in the behavioral\-comparison sense; because the public release contains one observation per \(item×\\timesmodel\) pair, the item\-fixed LRM slope is identified across models attempting the same item rather than across repeated samples from a single model, so the contrast is item\-controlled in the strict sense \(a point we return to in Limitations\)\. The two diagnostics together form a minimal two\-level test of human–LRM alignment in deliberation\.

The scope is bounded\. The LRM sample is six open\-weight thinking models with DeepSeek\-V3 as a non\-thinking baseline; closed\-frontier reasoning systems are not tested\. The human comparison is to the samples reported in the source release rather than humans in general\. Three reasoning paradigms are analysed: H\-ARC \(visual abstraction\) as the primary paradigm, INTUIT as a structurally distinct cross\-paradigm replication, and Cortes \(binary relational reasoning\) as a paradigm\-dependent boundary case\. The Methods, Results, and Discussion that follow develop the diagnostic, report its outcome, and recover from the outcome a candidate reading of what the two systems are doing differently when they think\.

#### Contributions\.

This paper makes three contributions\. First, it separates two behavioral levels that are routinely conflated in human–model comparison: cross\-item difficultyregistration\(does deliberation length track which items are hard?\) and within\-item deliberationallocation\(given items recognised as hard, where does the agent spend more thought?\)\. Second, it introduces an outcome\-conditioned, item\-controlled diagnostic that asks whether an agent allocates more deliberation to its own successes or failures, using each agent’s native deliberation scale and holding item identity exactly fixed\. Third, it uses this diagnostic to show that apparent human–LRM alignment at the cross\-item level masks an opposite within\-item allocation policy\. The point is not that seconds and reasoning tokens are equivalent, nor that reasoning traces directly reveal latent computation\. Rather, the point is that both are observable behavioural readouts whose outcome\-conditioned structure, when read within each agent’s own scale, constrains theories of metacognitive control, stopping, and resource allocation in a way that cross\-item correlation alone cannot\.

## 2Methods

#### Datasets and models\.

This study reanalyses publicly released, de\-identified human and LRM data; no new human\-subjects data were collected and no IRB review was required\. We use public data from thede Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)release and the underlying behavioural corpora\(LeGriset al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib10); Pruntyet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib11); Cortéset al\.,[2021](https://arxiv.org/html/2606.26502#bib.bib12)\)\. Table[1](https://arxiv.org/html/2606.26502#S2.T1)summarises the three reasoning paradigms\. H\-ARC\(LeGriset al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib10); Chollet,[2019](https://arxiv.org/html/2606.26502#bib.bib26)\)is the primary paradigm\. INTUIT is the clean cross\-paradigm replication; Cortes is a boundary case \(Generalisation and the Cortes boundary\); arithmetic is excluded because all thinking LRMs are at ceiling\. The thinking\-LRM sample \(DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib14)\), Qwen\-QwQ\-32B and Qwen3\-235B\-Thinking\(Qwen Team,[2025](https://arxiv.org/html/2606.26502#bib.bib16)\), GLM\-4\.5\-Air\-FP8\(Z\.ai,[2025](https://arxiv.org/html/2606.26502#bib.bib17)\), gpt\-oss\-20b and gpt\-oss\-120b\(OpenAI,[2025](https://arxiv.org/html/2606.26502#bib.bib15)\)\) is open\-weight only\. DeepSeek\-V3\(DeepSeek\-AI,[2024](https://arxiv.org/html/2606.26502#bib.bib13)\)is a non\-thinking control\. Within H\-ARC four LRMs are well\-powered \(DeepSeek\-R1, GLM\-4\.5\-Air\-FP8, gpt\-oss\-120b, Qwen\-QwQ\-32B; allnLRM≥298n\_\{\\text\{LRM\}\}\\geq 298\); gpt\-oss\-20b \(n=119n=119, parser\-limited\) and Qwen3\-235B\-Thinking \(n=43n=43,44wrong trials\) are reported descriptively\. All six thinking LRMs enter on INTUIT and Cortes\.

#### Deliberation measures\.

For thinking LRMs, deliberationttis thereasoning\_token\_lengthfield of the released frame: the number of intermediate\-trace tokens generated before the final answer, counted in each model’s own native tokenizer\. For the non\-thinking DeepSeek\-V3 control we usetotal\_output\_tokens\. We treatttas a behavioural readout of the model’s generation policy under nominally\-greedy decoding, not as a meter of latent computation: trace tokens are not guaranteed reports of underlying reasoning\(Valmeekamet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib19); Samineniet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib18); Kambhampatiet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib20)\)\. Each agent is its own within\-agent baseline, so seconds and tokens are never compared directly\. Followingde Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)’s analysis script, we deduplicate the released frame to one \(item, model\) row keeping the first attempt\. INTUIT traces are restricted in the public release, soInside the dissociationis restricted to H\-ARC and Cortes\. Full provenance, dedup, and per\-model parseable/correct/wrong counts are in the SI \(S6\)\.

Table 1:Datasets analysed\.aQwen3\-235B\-Thinking \(n=43n=43matched H\-ARC items\) and gpt\-oss\-20b \(n=119n=119, parser\-limited\) are flagged as low\-power on H\-ARC and reported separately\. Full dataset and scoring detail are in the OSF deposit\.
#### The within\-agent d\-ratio\.

For agentAA, paradigmPP, trialiiwith deliberationti\>0t\_\{i\}\>0and correctnessci∈\{0,1\}c\_\{i\}\\in\\\{0,1\\\}, the d\-ratio is the pooled\-SD Cohen’sddon log deliberation\. Heretit\_\{i\}is reaction\-time seconds for humans,reasoning\_token\_lengthfor thinking LRMs, and total output length for V3:dwrong−right​\(A,P\)=\(ℓ¯wrong−ℓ¯right\)/spooledd\_\{\\mathrm\{wrong\}\-\\mathrm\{right\}\}\(A,P\)=\(\\bar\{\\ell\}\_\{\\mathrm\{wrong\}\}\-\\bar\{\\ell\}\_\{\\mathrm\{right\}\}\)/s\_\{\\mathrm\{pooled\}\}withℓi=log⁡ti\\ell\_\{i\}=\\log t\_\{i\}\. Positive values mean the agent deliberates longer on its failures\.

#### Difficulty controls\.

A positive d\-ratio could in principle reflect a value\-insensitive stopping rule or a difficulty effect that correlates with errors\. To break the tautology we use two within\-agent controls\. \(a\) A leave\-one\-out ensemble\-difficulty regressionlog⁡ti=α\+βcorrect​ci\+βdiff​Di\(−A\)\+εi\\log t\_\{i\}=\\alpha\+\\beta\_\{\\mathrm\{correct\}\}\\,c\_\{i\}\+\\beta\_\{\\mathrm\{diff\}\}\\,D\_\{i\}^\{\(\-A\)\}\+\\varepsilon\_\{i\}, whereDi\(−A\)=1−c¯i,−AD\_\{i\}^\{\(\-A\)\}=1\-\\bar\{c\}\_\{i,\-A\}andc¯i,−A\\bar\{c\}\_\{i,\-A\}is the mean correctness on itemiitaken over all non\-focal agents available for that paradigm\. When the focal agentAAis the human population, the non\-focal agents are all thinking LRMs that attempted itemii\. WhenAAis one of the thinking LRMs, the non\-focal agents are the human population plus every other thinking LRM that attempted itemii\. \(b\) Item fixed\-effects regressions that absorb item identity with item dummies, identifying the correctness coefficient purely from variation within an item\. The headline combined dissociation specification islog⁡ti=α\+β​ci\+βLRM​ci×LRMi\+δagent\+γitem\+εi\\log t\_\{i\}=\\alpha\+\\beta\\,c\_\{i\}\+\\beta\_\{\\mathrm\{LRM\}\}\\,c\_\{i\}\\\!\\times\\\!\\mathrm\{LRM\}\_\{i\}\+\\delta\_\{\\text\{agent\}\}\+\\gamma\_\{\\text\{item\}\}\+\\varepsilon\_\{i\}pooling humans and LRMs; the interactionβLRM\\beta\_\{\\mathrm\{LRM\}\}is the within\-item dissociation\. We additionally fit humans\-only and LRMs\-only item\-FE specifications, per\-LRM versions of the combined regression \(humans plus one LRM at a time\), the same three regressions on INTUIT and Cortes, and a Mundlak\(Mundlak,[1978](https://arxiv.org/html/2606.26502#bib.bib32)\)re\-parameterisation that replaces item dummies with item\-level mean correctness\. Cluster\-robust SE by item throughout\. Full specifications and analysis scripts are in the OSF deposit\.Identification scope\.The released LRM data contain exactly one observed outcome per \(item×\\timesmodel\) pair, so the within\-item LRM correctness slope in the LRMs\-pooled and per\-LRM combined regressions is identified by variation across LRMs \(and across the human population\) on the same item, not by repeated stochastic samples of a single model\. The interactionβLRM\\beta\_\{\\mathrm\{LRM\}\}is therefore an item\-controlled human–LRM contrast, not a stochastic per\-model slope; a stochastic single\-model estimate would require multiple LRM samples per item, which the public release does not provide\. Full specification rationale is in SIS7\.

#### Hierarchical mixed\-effects test on H\-ARC\.

We fitlog⁡t∼cc\+LRM\+cc×LRM\\log t\\sim c\_\{c\}\+\\mathrm\{LRM\}\+c\_\{c\}\\\!\\times\\\!\\mathrm\{LRM\}with centred correctness, LRM indicator, and an item random intercept\. A crossed random\-effects refit and two cluster\-robust OLS specifications \(clustered by agent and by item; Table S4\) are reported as robustness anchors\. The item fixed\-effects estimate \(cluster\-robust SE by item\) is the primary difficulty\-controlled estimand; mixed\-effects and cluster variants are sensitivity checks \(SIS7\)\. Participant identifiers are unavailable in the public release, so clustering is at the item level\.

#### Trace\-content features\.

For each LRM trial on H\-ARC and Cortes \(INTUIT traces are “Restricted” in the public release\) we compute four length\-normalised features from the releasedreasoning\_trace: self\-doubt\-marker density per1,0001\{,\}000characters \(wait,actually,hmm,let me reconsider, …\), self\-correction\-marker density \(scratch that,ignore that, …\), word\-level55\-gram repetition rate, and type–token ratio\. For each feature we fitfeature∼correct\+log⁡w\+C​\(agent\)\+C​\(item\)\\text\{feature\}\\sim\\text\{correct\}\+\\log w\+C\(\\text\{agent\}\)\+C\(\\text\{item\}\)wherewwis trace length in words, cluster\-robust SE by item\. Full marker lists and the parser are in the OSF deposit \(seeData and Code Availability\)\.

#### Reproducibility and multiple comparisons\.

Analyses are deterministic with seed0xA12C; full code at the OSF repository \(seeData and Code Availability\)\. Implicit family of tests and family\-wise correction summary: SIS8\.

## 3Results

On the same items, humans and thinking LRMs agree on which ones are hard\. They disagree on what to do about it\. The registration test reproduces thede Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)result on H\-ARC: thinking\-LRM trace length correlates positively with human reaction time \(Spearmanρ=0\.16\\rho=0\.16–0\.300\.30, allp<\.001p<\.001across the five matched LRMs\), while non\-thinking DeepSeek\-V3 does not \(ρ=−0\.05\\rho=\-0\.05\)\. The same five LRMs separate cleanly from the matched human samples on the allocation axis: every thinking LRM lands atd\>0d\>0while the human point sits atd=−0\.10d=\-0\.10\. Registration position is uninformative about allocation position \(Figure[1](https://arxiv.org/html/2606.26502#S3.F1)b\)\. Two agents passing the same registration test can fail the allocation test in opposite directions\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x1.png)Figure 1:The two\-level diagnostic\.\(a\) Schematic\. An agent registers item difficulty \(a perceptual or evaluative difficulty signal\) and then allocates deliberation around that registered difficulty \(a stopping or scheduling rule that decides whether to keep thinking on the item now in front of it\)\. The cross\-item alignment ofde Vardaet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib4)\)probes registration; the within\-agent contrast introduced here probes allocation\. \(b\) The two diagnostics on H\-ARC\. Horizontal axis: cross\-item Spearmanρ\\rhobetween LRM reasoning\-token length and human reaction time \(registration\)\. Vertical axis: within\-agent Cohen’sddon log deliberation, wrong−\-right \(allocation\)\. The thinking LRMs \(five plotted; gpt\-oss\-20b shown as open\-circle, parser\-limited\) all sit in the upper\-right quadrant \(positive on both diagnostics\); the matched human reference is atd=−0\.10d=\-0\.10\(dashed\); the non\-thinking V3 baseline is in the lower\-left\. Registration position is not predictive of allocation position\. The vertical contrast is interpreted within each agent’s native deliberation scale; seconds and reasoning tokens are not compared in absolute magnitude\.### 3\.1The within\-agent allocation gap on H\-ARC

The H\-ARC result is a sign reversal: humans spend less time on failures once item difficulty is controlled, whereas LRMs spend more tokens on failures\. H\-ARC carries the inferential weight: it has the largest matched LRM sample and the cleanest item\-fixed primary test\. The allocation gap on H\-ARC has three layers, each shown as one panel of Figure[2](https://arxiv.org/html/2606.26502#S3.F2)\. \(i\) Per\-agent allocation\. Every analysed thinking LRM produces longer log deliberation on wrong than on right trials\. The well\-powered range isd=1\.47d=1\.47–3\.133\.13\(DeepSeek\-R1, GLM\-4\.5\-Air\-FP8, gpt\-oss\-120b, Qwen\-QwQ\-32B; allnLRM≥298n\_\{\\text\{LRM\}\}\\geq 298\), while the matched human sample is atd=−0\.10d=\-0\.10\(Figure[2](https://arxiv.org/html/2606.26502#S3.F2)a; Table S1\)\. Per\-bar failure\-budget amplification annotations show that the well\-powered LRMs spend at least a proportional share of their total compute on the trials they fail \(FBA=1\.03=1\.03–1\.191\.19\)\. \(ii\) The direction reverses across systems on the same paradigm\. Estimated marginal mean log deliberation by outcome \(Figure[2](https://arxiv.org/html/2606.26502#S3.F2)b\) shows a smallnegativeslope on the human side \(−0\.08\-0\.08log units, wrong−\-right\) and a largepositiveslope on the LRM side \(\+0\.78\+0\.78\)\. Held on their own scales, the two systems disagree on the sign of the correctness slope\. \(iii\) The dissociation survives item fixed effects and several mixed\-effects and cluster\-robust variants \(Figure[2](https://arxiv.org/html/2606.26502#S3.F2)c\)\. The primary item fixed\-effects estimand absorbs item identity exactly and identifies the agent\-type×\\timescorrectness interaction purely from within\-item variation:βLRM=−0\.66\\beta\_\{\\mathrm\{LRM\}\}=\-0\.66log units \(p<\.001p<\.001,95%95\\%CI\[−0\.81,−0\.50\]\[\-0\.81,\-0\.50\],n=5,728n=5\{,\}728\)\. Four sensitivity specifications triangulate the same negative interaction, with point estimates between−0\.66\-0\.66and−0\.94\-0\.94and all95%95\\%CIs strictly below zero: item random intercept, crossed agent\+\+item random effects, OLS with cluster\-robust SE by item, and OLS with cluster\-robust SE by agent \(Figure[2](https://arxiv.org/html/2606.26502#S3.F2)c; Table S4\)\. We treat the random\-effects magnitude differences as expected shrinkage rather than substantive\. A Mundlak\(Mundlak,[1978](https://arxiv.org/html/2606.26502#bib.bib32)\)re\-parameterisation that replaces item dummies with item\-level mean correctness produces a still larger dissociation \(βLRM=−0\.79\\beta\_\{\\mathrm\{LRM\}\}=\-0\.79,p<\.001p<\.001; within\-item and between\-item slopes differ atp<\.001p<\.001in both agent groups; SI Table S2\)\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x2.png)Figure 2:The within\-agent allocation gap on H\-ARC\.\(a\) Per\-agent within\-agent Cohen’sddon log deliberation \(wrong−\-right\)\. The asterisk on gpt\-oss\-20b flags its parser\-limited status; Qwen3\-235B\-Thinking is omitted here and reported in Table S1\.Failure\-budget amplification\(FBA\) is the share of LRM compute spent on wrong trials, divided by the share of trials that are wrong\. \(b\) Estimated marginal mean log deliberation by outcome, with bootstrapped95%95\\%CIs\. Humans \(left axis, log RT\) descend on wrong trials; LRMs \(right axis, log reasoning tokens\) ascend\. The two y\-axes are scaled independently to align each system’s within\-agent change; cross\-system magnitudes are not directly comparable\. \(c\) Specification\-robustness forest for the agent\-type×\\timescorrectness interaction; the diamond is the primary item\-fixed\-effects estimand \(Table S4\)\.One feature of the LRM sample bears on the mechanism reading\. The non\-thinking DeepSeek\-V3 baseline adds a useful constraint\. V3 shows a positive d\-ratio on Cortes \(d=1\.41d=1\.41,95%95\\%CI\[1\.06,1\.84\]\[1\.06,1\.84\]\) and a directionally positive but underpowered d\-ratio on H\-ARC \(d=\+0\.52d=\+0\.52,95%95\\%bootstrap CI\[−0\.22,\+1\.24\]\[\-0\.22,\+1\.24\],nright=18n\_\{\\text\{right\}\}=18at the4\.5%4\.5\\%accuracy floor\); V3 on INTUIT is too floored to estimate\. V3 has no explicit reasoning trace and no thinking\-targeted post\-training\. The V3 results therefore indicate that at least part of the wrong\-trial length expansion in the thinking\-LRM family is shared with non\-thinking base\-model generation under uncertainty, rather than being attributable exclusively to a reasoning\-specific allocation policy\. The data do not let us partition the share\. The matched human samples on the same items nonetheless show no comparable wrong\-trial expansion\.

### 3\.2Generalisation and the Cortes boundary

We treat INTUIT as the cross\-paradigm replication test and Cortes as the boundary test\. The former asks whether the H\-ARC dissociation generalises to a structurally distinct multi\-choice reasoning paradigm; the latter asks how far the allocation diagnostic survives when the task has a binary relational answer space and few LRM errors\. INTUIT preserves the sign dissociation\. Cortes preserves the agent\-type contrast but reveals a boundary on the LRM\-only within\-item slope\.

The same allocation contrast on the other two non\-saturated paradigms \(Figure[3](https://arxiv.org/html/2606.26502#S3.F3)\) extends the H\-ARC pattern in two informative ways\. INTUIT \(intuitive physics, multi\-choice\) is the clean cross\-paradigm replication\. Every analysed thinking LRM lands atd\>0d\>0\(d=0\.59d=0\.59–1\.421\.42across six models\) while the matched human sample is atd=\+0\.07d=\+0\.07, and the item\-FE agent\-type interaction isβLRM=−0\.41\\beta\_\{\\mathrm\{LRM\}\}=\-0\.41\(p<\.001p<\.001\)\. Cortes \(binary relational reasoning\) is a paradigm\-dependent boundary\. On Cortes, the marginal d\-ratio and the item\-fixed\-effect slope diverge: the d\-ratio remains wrong\-longer overall \(every thinking LRM atd=0\.82d=0\.82–1\.961\.96\), but once item identity is absorbed, the pooled LRM correctness slope becomes positive \(\+0\.31\+0\.31, opposite in sign to its negative slopes on H\-ARC and INTUIT\)\. We therefore treat Cortes as a boundary case rather than as a clean replication of the LRM\-side mechanism\. The agent\-type interaction is nevertheless preserved \(βLRM=−0\.97\\beta\_\{\\mathrm\{LRM\}\}=\-0\.97,p<\.001p<\.001\), because the human side is more strongly negative on Cortes \(d=−0\.31d=\-0\.31\)\. On Cortes both systems sit on the engagement side of zero on the within\-item slope, and the dissociation survives only because the human positive within\-item slope is steeper than the LRM one\. We do not interpret the Cortes inversion mechanistically here: per\-LRM wrong\-trial counts are small \(55–88per model\) and the answer space is structurally distinct \(binary versus generative or multi\-choice\)\. Per\-paradigm item\-FE coefficients and the underlying per\-LRM interactions are reported in SI Tables S5–S6; per\-LRM Cortes bootstrap intervals\(Efron and Tibshirani,[1993](https://arxiv.org/html/2606.26502#bib.bib31)\)are wide but exclude zero for all six thinking LRMs \(SI Figure S3\)\.

At the per\-LRM level the dissociation is rank\-stable across paradigms: the same model that dissociates most strongly on H\-ARC tends to dissociate most strongly on the other paradigms \(Spearmanρ=0\.77\\rho=0\.77–0\.890\.89across the three paradigm pairs,n=6n=6models per pair; Qwen\-QwQ\-32B is the strongest\-dissociating model on every paradigm\)\. Withn=6n=6this is descriptive, not inferential, but the direction is consistent with the dissociation reflecting a model\-level \(training\-pipeline\) signature rather than an item\- or paradigm\-specific artefact\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x3.png)Figure 3:Cross\-paradigm allocation gap\.Per\-agent within\-agent Cohen’sddon log deliberation \(wrong−\-right\) for the two non\-saturated paradigms beyond H\-ARC\. INTUIT \(intuitive physics\) is the clean replication; Cortes \(binary relational reasoning\) is a paradigm\-dependent boundary in which the agent\-type interaction is preserved but the LRM\-only within\-item slope reverses sign \(seeGeneralisation and the Cortes boundary\)\. H\-ARC is the primary evidence and is shown in Figure[2](https://arxiv.org/html/2606.26502#S3.F2)a\.
### 3\.3Inside the dissociation: human engagement, LRM length\-on\-uncertainty

A behavioural dissociation alone underdetermines mechanism\. We use two convergent within\-system probes \(Figure[4](https://arxiv.org/html/2606.26502#S3.F4)\) to sketch what is producing each side of the allocation gap: human time appears partly gated by engagement, whereas LRM length tracks uncertainty\-like trace features\. The labels “engagement\-vs\-abandonment” and “length\-on\-uncertainty” are interpretations of these behavioural signatures, not direct evidence of internal architecture\.

The two human numbers — a near\-zero H\-ARC d\-ratio \(d=−0\.10d=\-0\.10, slightly negative; Figure[2](https://arxiv.org/html/2606.26502#S3.F2)a,B\) and apositivewithin\-item slope \(\+0\.24\+0\.24log units\) — look at first like a contradiction but reflect different operations\. The d\-ratio mixes within\- and between\-item variance, so it inherits the cross\-item difficulty confound that the within\-agent contrast was designed to remove; once item identity is held fixed, the human slope is positive and reveals the engagement signature analysed below\.

The positive within\-item human slope on H\-ARC \(\+0\.24\+0\.24log units\) is not noise: it is the visible imprint of an engagement\-vs\-abandonment process separable from problem\-solving itself\. The H\-ARC release preserves a per\-trial action count \(Num\_actions\_attempt\_1, the number of grid edits the participant made\), which decomposes the slope into engagement and disengagement components\. Within the human wrong cell, the fast\-wrong quartile shows median1616grid actions while the slow\-wrong quartile shows7979\(5×5\\timesmore\); fast\-wrong trials look like abandonment, slow\-wrong trials like engaged\-but\-failed effort\. The slope is concentrated on the harder items \(bottom\-quartileβcorrect=human\+0\.32\\beta\_\{\\mathrm\{correct\}\}\{\}^\{\\mathrm\{human\}\}=\+0\.32,p<\.001p<\.001; top\-quartile\+0\.13\+0\.13,p=\.05p=\.05\), as one would predict if hard items disproportionately attract abandonment\. Per\-item RT coefficient of variation predicts per\-item slope across341341items with both right and wrong trials \(Spearmanρ=\+0\.27\\rho=\+0\.27,p=5×10−7p=5\\times 10^\{\-7\}\), and addinglog⁡\(1\+actions\)\\log\(1\+\\text\{actions\}\)as a covariate cuts the within\-item slope nearly in half, from\+0\.24\+0\.24to\+0\.12\+0\.12\(Figure[4](https://arxiv.org/html/2606.26502#S3.F4)a\)\. The Mundlak decomposition corroborates: the humanbetween\-itemcorrectness slope is−0\.54\-0\.54\(items more humans solve have shorter mean RT\), opposite to the within\-item slope, exactly as predicted under abandonment between items and engagement within them\.

The LRM side leaves a complementary footprint in the trace itself\. Holding trace length, item, and agent fixed, wrong\-trial chains on H\-ARC contain0\.430\.43more self\-doubt / hedging markers per1,0001\{,\}000characters than right\-trial chains \(βcorrect=−0\.435\\beta\_\{\\mathrm\{correct\}\}=\-0\.435,p=7×10−4p=7\\times 10^\{\-4\}; Figure[4](https://arxiv.org/html/2606.26502#S3.F4)b\)\. The direction is the same in5/55/5thinking models\. On Cortes \(binary action space, fewer opportunities for verbal hedging\) the asymmetry surfaces in lexical structure rather than in lexical hedging: wrong\-trial Cortes chains show7\.4%7\.4\\%higher55\-gram repetition \(β=−0\.074\\beta=\-0\.074,p=7×10−8p=7\\times 10^\{\-8\}\) and4\.14\.1percentage\-points lower type–token ratio \(β=\+0\.041\\beta=\+0\.041,p=10−3p=10^\{\-3\}\), with the predicted direction in6/66/6Cortes models\. The two paradigms thus carry convergent rather than identical signatures: hedging surfaces in language\-rich generative output, lexical contraction under a constrained binary action space\. Both are consistent with a shared length\-on\-uncertainty mechanism\. Surface markers are descriptive proxies, not mechanism; controlled truncation is the discriminating intervention \(see Discussion; full regression in SI Table S7\)\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x4.png)Figure 4:Two convergent behavioural probes of the proposed mechanism\.The labels are mechanistic interpretations of behavioural signatures, not direct evidence of internal architecture\. \(a\) Human engagement decomposition\. The raw within\-item correctness slope on H\-ARC human log RT isβ=\+0\.241\\beta=\+0\.241; controlling forlog⁡\(1\+actions\)\\log\(1\+\\text\{actions\}\)\(the per\-trial grid\-action count\) reduces it toβ=\+0\.125\\beta=\+0\.125, a48%48\\%reduction\. \(b\) LRM trace\-content asymmetries: length\-controlled coefficients on correctness for three features\. Negativeβ\\betaon hedging or repetition means the feature ishigheron wrong trials\.
### 3\.4Robustness

The H\-ARC allocation gap survives three further robustness checks \(SIS12, Figure S2\): a positive LRM d\-ratio in nearly every difficulty quartile, an item\-level human d\-ratio of\+0\.60\+0\.60that recreates the cross\-item difficulty confound the within\-agent contrast avoids, and an aggregated\-cell re\-estimation that returns the sameβLRM=−0\.66\\beta\_\{\\mathrm\{LRM\}\}=\-0\.66\(SIS4\)\. The Supplementary Information is organised to separate inferential robustness checks from descriptive low\-power cells: per\-LRM H\-ARC estimates \(S1\), aggregation against within\-item human pseudo\-replication \(S4\), specification variants \(S9\), multiple\-comparison treatment \(S8\), and cross\-paradigm item\-FE regressions \(S10\) are reported separately so readers can audit the primary estimand without being asked to over\-interpret descriptive cells\.

## 4Discussion

### 4\.1What the diagnostic reveals

The matched data show that registration and allocation can come apart: humans and thinking LRMs grade items by difficulty in the same order, yet allocate computation around that registered difficulty by different rules\. A mechanism\-level claim of shared reasoning is therefore not licensed by cross\-item alignment alone\. What follows is interpretive: behavioural dissociations under\-determine mechanism, and the allocation patterns areconsistent withthe principles below, not proof of them\.

On the human side, the within\-item positive slope and its action\-count and Mundlak decompositions \(Inside the dissociation\) point to anengagement\-vs\-abandonment schedule: a metacognitive evaluation of whether the item is worth pursuing, on which trials that end in success show sustained engagement while fast errors look like abandonment\. The sign flip between the within\-item and between\-item human slopes is the signature such a schedule predicts\.

On the LRM side, the within\-agent allocation gap on H\-ARC, its replication across paradigms, and the trace\-content asymmetries reported inInside the dissociationare consistent with alength\-on\-uncertainty policy: when the model is uncertain it generates more reasoning tokens, with an uncertain relation to whether those tokens carry task\-relevant computational value\. Surface markers do not establish that the additional tokens lack computational value; they motivate an intervention\. Controlled truncation is needed to separate useful late search from length expansion that continues after the answer has effectively stabilised\.

The non\-thinking DeepSeek\-V3 control \(The within\-agent allocation gap on H\-ARC\) leaves two readings open: abase\-modelreading, on which the wrong\-trial length expansion is a general property of next\-token generation that thinking\-LRMs inherit, and astratifiedreading, on which thinking\-LRMs add a reasoning\-targeted component picked out by the trace\-content asymmetries\. The truncation experiment ofHow to test this furtheris the first cut at separating them\. Either way the cogsci\-side claim is robust: the human samples on the same items show no comparable wrong\-trial expansion, and the two systems diverge not inwhetherdifficulty registers but in the rule mapping registered difficulty to deliberation time\. The next subsection makes that rule\-level divergence formal\.

### 4\.2A resource\-rational synthesis

We read the dissociation through the lens of resource\-rational analysis\(Lieder and Griffiths,[2020](https://arxiv.org/html/2606.26502#bib.bib1); Gershmanet al\.,[2015](https://arxiv.org/html/2606.26502#bib.bib30); Callawayet al\.,[2022](https://arxiv.org/html/2606.26502#bib.bib33)\)and its metareasoning antecedent\(Russell and Wefald,[1991](https://arxiv.org/html/2606.26502#bib.bib2)\)\. LetVOC​\(t∣item\)=𝔼​\[Δ​accuracy∣t​more units\]−κ​t\\mathrm\{VOC\}\(t\\mid\\mathrm\{item\}\)=\\mathbb\{E\}\[\\,\\Delta\\mathrm\{accuracy\}\\mid t\\text\{ more units\}\\,\]\-\\kappa\\,tdenote the value of computation forttadditional log\-token units on a given item, withκ\>0\\kappa\>0a monotone cost\. The resource\-rational stopping rule is to stop when marginalVOC≤0\\mathrm\{VOC\}\\leq 0\. Computation earns its cost only where it can still move the answer: resource\-rational belief updating makes this premise explicit, with additional compute carrying beliefs toward the Bayes\-optimal posterior in proportion to the uncertainty that compute can still reduce\(Zhu and Griffiths,[2025](https://arxiv.org/html/2606.26502#bib.bib36)\)\. This rule does not predict that harder itemsalwaystake longer: as item difficulty grows, the agent should engage harder when the marginal accuracy gain decays slowly \(positive expected return on morett\) andabandonwhen it vanishes \(no return to be had\)\. The within\-item, outcome\-conditioned slope is exactly the observable that separates these two regimes on the same item\.

Under this lens, the human pattern \(longertton solved trials, shorttton fast\-wrong trials, with the engagement\-vs\-abandonment gap on slow\-wrong trials\) is consistent with the resource\-rational rule underanymonotoneκ\\kappa: extrattis allocated where expected marginal accuracy gain remains positive and withdrawn where it does not, and the agent’s own subsequent solve/no\-solve outcome is an informative \(if noisy\) revealed proxy for that expectation\. The thinking\-LRM pattern inverts this revealed mapping: extrattis allocatedex anteto items on which the model’s own continuation behaviour ultimately reveals near\-zero return\. Reconciling this with a resource\-rational stopping rule underanymonotoneκ\\kapparequires one of two things: either the agent’s expected\-gain forecast fails to condition on success\-relevant features of the item—spending compute as if the remaining uncertainty were reducible by a longer chain when on these items it is closer to irreducible\(Der Kiureghian and Ditlevsen,[2009](https://arxiv.org/html/2606.26502#bib.bib37)\)—or the objective being optimised is not item accuracy\. Item fixed effects absorb shared item difficulty but not agent\-specific affordances or model\-specific calibration failures\. We therefore cannot formally rule out the first horn; the data shift interpretive weight toward the second without excluding the first\. The empirical contrast is therefore not a direct estimate of VOC, but a sign constraint on the mapping from registered difficulty to observable deliberation\. On Cortes both systems land in the engagement regime \(positive within\-item slopes\), so the VOC reading there is weaker and the dissociation survives only because the human positive slope is steeper than the LRM one\. LRM behaviour may well be optimal under a different objective \(RL training signals, verifier rewards, length\-conditioned preferences\)\(Hu,[2026](https://arxiv.org/html/2606.26502#bib.bib9); Samineniet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib18)\); the right reading is not “humans rational, LRMs irrational” but that the two systems are not implementing the same resource\-rational deliberation policy even if each is sensible under its own objective\. This is the formal version of the claim, made inWhy this mattersbelow, that a single “deliberation” construct papers over a real distinction\.

### 4\.3Why this matters

The first consequence is methodological\. The two\-level diagnostic should be the default in future LRM\-as\-cognitive\-model work: registration alone is not a mechanism claim, and any cross\-item alignment argument should be paired with the within\-agent allocation contrast on the same items\. In its standard cognitive\-science usage, “cost of thinking” encompasses both how an agent grades items by difficulty and how it allocates computation around that grading; the sign reversal documented here at the second level shows that cross\-item alignment alone does not license that framing in its full sense\.

The second consequence is metacognitive, and it is the most substantive cogsci claim the dissociation supports\. The metacognition literature distinguishes second\-ordermonitoring\(verbalised uncertainty, hedging, confidence\) from second\-ordercontrolthat acts on the first\-order computation itself, terminating it when its expected value is low\(Nelson and Narens,[1990](https://arxiv.org/html/2606.26502#bib.bib27); Yeung and Summerfield,[2012](https://arxiv.org/html/2606.26502#bib.bib28); Nelson and Leonesio,[1988](https://arxiv.org/html/2606.26502#bib.bib35)\)\. The matched dissociation suggests these two components aredecoupledin current thinking\-LRMs in a way they are not in the matched human samples\. On the LRM side, monitoring signatures are present and even rise on wrong trials \(hedging on H\-ARC, lexical contraction on Cortes\), but they appear weakly coupled to stopping: the model continues at length on the trials it ultimately fails\. On the human side, a fast\-wrong abandonment cell coexists with a slow\-wrong engaged\-but\-failed cell on the same items, the hallmark of monitoring that does gate effort allocation\. A single “deliberates longer when uncertain” description is therefore true of both systems at the cross\-item level but misleading at the within\-item level\. The same surface phenomenon is consistent with a coupled monitor\-controller in one case and with weaker coupling between uncertainty markers and stopping in the other\. Whether thinking\-LRMs have a separable stopping circuit at all, or whether their “stopping” is the autoregressive decay of generation, becomes the architectural question the data sharpen but cannot settle alone\. The truncation experiment is the natural follow\-up\.

### 4\.4How to test this further

The clearest discriminating prediction is intervention\. This converges withHu \([2026](https://arxiv.org/html/2606.26502#bib.bib9)\)’s call for inference\-time truncation as a diagnostic to distinguish performative scaffolding from incremental computation\. It also follows the authors’ own reply to Vankov, which reports that on H\-ARC, the single non\-ceiling task, manipulating reasoning effort has a large accuracy effect\(de Vardaet al\.,[2026b](https://arxiv.org/html/2606.26502#bib.bib5)\): extra reasoning is not globally inert on this paradigm, which is exactly why an item\-controlled, outcome\-conditioned truncation test, rather than a global effort manipulation, is needed to ask whether the wrong\-trial expansion specifically is load\-bearing\. A controlled truncation manipulation, with repeated LRM samples at fixed token budgetsf∈\(0,1\)f\\in\(0,1\)on the same items where we observe the allocation gap, recoveringp​\(correct∣t,item\)p\(\\mathrm\{correct\}\\mid t,\\mathrm\{item\}\), produces qualitatively different curves under the two principles \(Figure[5](https://arxiv.org/html/2606.26502#S4.F5)\)\. A length\-on\-uncertainty pattern would converge on its answer well before the trace ends, predicting a high\-accuracy plateau under moderate truncation; a useful\-search interpretation predicts a late, steep rise as truncation cuts into load\-bearing computation\. The two predictions separate most strongly in thef≈0\.5f\\approx 0\.5–0\.80\.8range\. In the terms of the synthesis above, this is a direct test of whether the uncertainty driving the wrong\-trial expansion is the reducible kind that added computation can resolve or not\. Alternative accounts remain open \(a value\-of\-computation account under a non\-task\-aligned objective could still fit; SIS5\), but truncation discriminates them at the level of the data they predict\. The present paper establishes the dissociation empirically and identifies truncation as the discriminating intervention; running it is left to follow\-up\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x5.png)Figure 5:Illustrative truncation curves under two candidate principles\(illustrative simulation, not a fit\)\. A useful\-search interpretation predicts a late, steep accuracy rise as the truncation budget grows; a length\-on\-uncertainty / padding interpretation predicts an earlier rise and a broad high\-accuracy plateau under moderate truncation\. The curves diverge most strongly betweenf=0\.5f\\\!=\\\!0\.5andf=0\.8f\\\!=\\\!0\.8\.
### 4\.5Limitations

Six limitations constrain what we can conclude from the data\. First,causal value of the extra LRM tokens\.The data are observational\. We do not show that the extra wrong\-trial tokens have zero or low causal value on those particular items; we show that their systematic allocation is the opposite of what the matched human samples do on the same items\. This limitation is exactly why the truncation experiment is framed as the discriminating follow\-up \(seeHow to test this further\), rather than as evidence already supplied by the present analyses\. Because the public release contains a single observation per \(item×\\timesmodel\) pair \(see Methods scope note\), we cannot estimate a per\-modelp​\(correct∣t,item\)p\(\\mathrm\{correct\}\\mid t,\\mathrm\{item\}\)function; the outcome\-conditioned within\-agent contrast is the cleanest probe the data afford\. The controlled\-truncation experiment we describe is the natural follow\-up; until it is run, claims about the computational value of the LRM tokens are interpretive\. Second,LRM scope\.The sample is six open\-weight thinking LRMs from four organisations\. Closed\-frontier reasoning systems \(e\.g\. o\-series, Gemini\-style frontier models\) are not tested; the per\-LRM rank\-stability we report is descriptive atn=6n=6and may not extend\. Third,human scope and identifiers\.“Humans” here means the human populations reported in the source release\(LeGriset al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib10); Pruntyet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib11); de Vardaet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib4)\), with their recruitment and selection criteria\. The public release does not retain participant identifiers; participant\-level differences in speed, ability, and engagement therefore cannot be separated from trial\-level engagement, and the engagement\-vs\-abandonment reading isconsistent withthe action\-count and Mundlak evidence rather than a definitive demonstration that humans implement a particular stopping rule\. Fourth,Cortes is a boundary case\.The LRM\-only within\-item slope on Cortes is positive \(\+0\.31\+0\.31\), opposite to H\-ARC and INTUIT; the agent\-type dissociation survives because humans show an even stronger positive slope, but the LRM unilateral within\-item sign is paradigm\-dependent\. Fifth,measurement\.LRM reasoning\-token length and human RT are analogous, not interchangeable\. Reasoning traces are not guaranteed reports of latent computation\(Valmeekamet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib19); Kambhampatiet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib20); Samineniet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib18)\); human RT decomposes into many sub\-processes\(Ratcliff and McKoon,[2008](https://arxiv.org/html/2606.26502#bib.bib21); Heitz,[2014](https://arxiv.org/html/2606.26502#bib.bib22); Bogaczet al\.,[2006](https://arxiv.org/html/2606.26502#bib.bib3)\)\. The within\-agent contrast requires only outcome\-conditioned structure in each system’s own units; cross\-system magnitude comparisons are not interpretable\. Sixth,architecture\.Behavioural dissociations constrain mechanism but cannot establish architecture\. The length\-on\-uncertainty / engagement\-vs\-abandonment reading is a hypothesis the data are consistent with, not a claim the data prove\.

## Acknowledgements

We thank the authors of the de Varda et al\. release for making the LRM and human reasoning data public, and the H\-ARC, INTUIT, and Cortes data creators for the underlying behavioural datasets\. No compute or data redistribution was required from those teams; the present analyses use only their public deposits\.

## Declarations

#### Funding\.

The author received no specific funding for this work\.

#### Competing Interests\.

The author declares no competing interests\.

#### Author Contributions\.

H\. Wang is the sole author of this manuscript and is solely responsible for the conception of the study, the design and implementation of the analyses, the interpretation of the results, and the writing of the manuscript\.

#### Ethics Approval\.

This study did not involve any new collection of data from human or animal participants\. All analyses are secondary re\-analyses of publicly released, fully de\-identified behavioural and large reasoning model output corpora \(the de Varda et al\. release, H\-ARC, INTUIT, and Cortes\); no identifiable personal data are accessed at any point in the pipeline\. Under the institutional research\-ethics policy of The University of Hong Kong and the Common Rule \(45 CFR 46\.104\(d\)\(4\)\), secondary analysis of publicly available de\-identified data is exempt from ethics\-board review\. No further ethics approval was therefore required or sought\.

#### Consent to Participate\.

Not applicable\. No new human participants were recruited or tested for this study; the analyses use only publicly released, de\-identified secondary data, for which the original data collectors obtained the relevant participant consent\.

#### Consent to Publish\.

Not applicable\. The manuscript contains no individually identifiable data, images, or other personal information from any human participant\.

#### Data and Code Availability\.

All behavioural data analysed in this study are public and are not redistributed by the present manuscript\. The de Varda et al\. release is at[https://osf\.io/r3kum/](https://osf.io/r3kum/); H\-ARC is accompanied byLeGriset al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib10)\); INTUIT is accompanied byPruntyet al\.\([2025](https://arxiv.org/html/2606.26502#bib.bib11)\); Cortes is redistributed in the de Varda deposit\. The full analysis pipeline \(data acquisition, scoring, statistical analyses, and figure generation\) and intermediate results are openly available at the OSF repository[osf\.io/jxb4a](https://osf.io/jxb4a/)\. The repository ships pinned dependencies, a deterministic random seed \(0xA12C\), and aMakefiletarget that reproduces every results end\-to\-end from the public source data\.

## References

- M\. Binz and E\. Schulz \(2023\)Using cognitive psychology to understand GPT\-3\.Proceedings of the National Academy of Sciences120\(6\),pp\. e2218523120\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2218523120)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p2.1)\.
- R\. Bogacz, E\. Brown, J\. Moehlis, P\. Holmes, and J\. D\. Cohen \(2006\)The physics of optimal decision making: A formal analysis of models of performance in two\-alternative forced\-choice tasks\.Psychological Review113\(4\),pp\. 700–765\.External Links:[Document](https://dx.doi.org/10.1037/0033-295X.113.4.700)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- F\. Callaway, B\. van Opheusden, S\. Gul, P\. Das, P\. M\. Krueger, T\. L\. Griffiths, and F\. Lieder \(2022\)Rational use of cognitive resources in human planning\.Nature Human Behaviour6\(8\),pp\. 1112–1125\.External Links:[Document](https://dx.doi.org/10.1038/s41562-022-01332-8)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p1.5)\.
- F\. Chollet \(2019\)On the measure of intelligence\.arXiv preprint arXiv:1911\.01547\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1911.01547)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- R\. A\. Cortés, A\. B\. Weinberger, G\. A\. Colaizzi, G\. F\. Porter, E\. L\. Dyke, H\. O\. Keaton, D\. L\. Walker, and A\. E\. Green \(2021\)What makes mental modeling difficult? normative data for the multidimensional relational reasoning task\.Frontiers in Psychology12,pp\. 668256\.External Links:[Document](https://dx.doi.org/10.3389/fpsyg.2021.668256)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- A\. G\. de Varda, F\. P\. D’Elia, H\. Kean, A\. Lampinen, and E\. Fedorenko \(2025\)The cost of thinking is similar between large reasoning models and humans\.Proceedings of the National Academy of Sciences122\(47\),pp\. e2520077122\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2520077122)Cited by:[S6\. LRM data provenance and trace counting](https://arxiv.org/html/2606.26502#S0.SSx6.p1.3),[§1](https://arxiv.org/html/2606.26502#S1.p2.1),[§1](https://arxiv.org/html/2606.26502#S1.p3.1),[§1](https://arxiv.org/html/2606.26502#S1.p4.1),[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px2.p1.2),[Figure 1](https://arxiv.org/html/2606.26502#S3.F1),[§3](https://arxiv.org/html/2606.26502#S3.p1.6),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- A\. G\. de Varda, F\. P\. D’Elia, H\. Kean, A\. Lampinen, and E\. Fedorenko \(2026a\)Reply to Dujmović: The alignment in cost between human and model reasoning is an empirical phenomenon worth explaining\.Proceedings of the National Academy of Sciences123\(4\),pp\. e2536153123\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2536153123)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p3.1)\.
- A\. G\. de Varda, F\. P\. D’Elia, H\. Kean, A\. Lampinen, and E\. Fedorenko \(2026b\)Reply to Vankov et al\.: Reasoning traces are linked to accuracy and capture key dimensions of problem complexity\.Proceedings of the National Academy of Sciences123\(12\),pp\. e2603574123\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2603574123)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p3.1),[§4\.4](https://arxiv.org/html/2606.26502#S4.SS4.p1.4)\.
- DeepSeek\-AI \(2024\)DeepSeek\-V3 technical report\.arXiv preprint arXiv:2412\.19437\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2412.19437)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- A\. Der Kiureghian and O\. Ditlevsen \(2009\)Aleatory or epistemic? Does it matter?\.Structural Safety31\(2\),pp\. 105–112\.External Links:[Document](https://dx.doi.org/10.1016/j.strusafe.2008.06.020)Cited by:[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p2.6)\.
- M\. Dujmović \(2026\)No deep insights into the alignment between human and deep learning reasoning processes: Thoughts on de Varda et al\. \(2025\)\.Proceedings of the National Academy of Sciences123\(4\),pp\. e2533685123\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2533685123)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p3.1)\.
- B\. Efron and R\. J\. Tibshirani \(1993\)An introduction to the bootstrap\.Chapman & Hall\.External Links:ISBN 978\-0412042317Cited by:[S12\. Robustness checks and Cortes bootstrap intervals](https://arxiv.org/html/2606.26502#S0.SSx12.p1.6),[§3\.2](https://arxiv.org/html/2606.26502#S3.SS2.p2.14)\.
- S\. J\. Gershman, E\. J\. Horvitz, and J\. B\. Tenenbaum \(2015\)Computational rationality: A converging paradigm for intelligence in brains, minds, and machines\.Science349\(6245\),pp\. 273–278\.External Links:[Document](https://dx.doi.org/10.1126/science.aac6076)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p1.5)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, others, and Z\. Zhang \(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- R\. P\. Heitz \(2014\)The speed\-accuracy tradeoff: History, physiology, methodology, and behavior\.Frontiers in Neuroscience8,pp\. 150\.External Links:[Document](https://dx.doi.org/10.3389/fnins.2014.00150)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- Y\. Hu \(2026\)“Thinking traces” in large reasoning models: Cognitive cost or performative scaffolding?\.Proceedings of the National Academy of Sciences123\(17\),pp\. e2604554123\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2604554123)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p3.1),[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p2.6),[§4\.4](https://arxiv.org/html/2606.26502#S4.SS4.p1.4)\.
- S\. Kambhampati, K\. Valmeekam, S\. Bhambri, V\. Palod, L\. P\. Saldyt, K\. Stechly, S\. R\. Samineni, D\. Kalwar, and U\. Biswas \(2025\)Position: Stop anthropomorphizing intermediate tokens as reasoning / thinking traces\!\.arXiv preprint arXiv:2504\.09762\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2504.09762)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px2.p1.2),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in Neural Information Processing Systems35,pp\. 22199–22213\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2205.11916)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p2.1)\.
- S\. LeGris, W\. K\. Vong, B\. M\. Lake, and T\. M\. Gureckis \(2025\)A comprehensive behavioral dataset for the Abstraction and Reasoning Corpus\.Scientific Data12,pp\. Article 1380\.External Links:[Document](https://dx.doi.org/10.1038/s41597-025-05687-1)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4),[Data and Code Availability\.](https://arxiv.org/html/2606.26502#Sx2.SS5.SSS0.Px7.p1.1)\.
- F\. Lieder and T\. L\. Griffiths \(2020\)Resource\-rational analysis: Understanding human cognition as the optimal use of limited computational resources\.Behavioral and Brain Sciences43,pp\. e1\.External Links:[Document](https://dx.doi.org/10.1017/S0140525X1900061X)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p1.5)\.
- Y\. Mundlak \(1978\)On the pooling of time series and cross section data\.Econometrica46\(1\),pp\. 69–85\.External Links:[Document](https://dx.doi.org/10.2307/1913646)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px4.p1.12),[§3\.1](https://arxiv.org/html/2606.26502#S3.SS1.p1.22)\.
- T\. O\. Nelson and R\. J\. Leonesio \(1988\)Allocation of self\-paced study time and the “labor\-in\-vain effect”\.Journal of Experimental Psychology: Learning, Memory, and Cognition14\(4\),pp\. 676–686\.External Links:[Document](https://dx.doi.org/10.1037/0278-7393.14.4.676)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.3](https://arxiv.org/html/2606.26502#S4.SS3.p2.1)\.
- T\. O\. Nelson and L\. Narens \(1990\)Metamemory: a theoretical framework and new findings\.InPsychology of Learning and Motivation,G\. H\. Bower \(Ed\.\),Vol\.26,pp\. 125–173\.External Links:[Document](https://dx.doi.org/10.1016/S0079-7421%2808%2960053-5)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.3](https://arxiv.org/html/2606.26502#S4.SS3.p2.1)\.
- OpenAI \(2025\)gpt\-oss\-20b and gpt\-oss\-120b \[model cards\]\.Note:Hugging FaceExternal Links:[Link](https://openai.com/index/introducing-gpt-oss/)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- J\. Prunty, A\. O’Flynn, P\. Quinn, and L\. G\. Cheke \(2025\)INTUIT: investigating intuitive reasoning in humans and language models\.InProceedings of the 47th Annual Meeting of the Cognitive Science Society,External Links:[Link](https://escholarship.org/uc/item/33z8g5dn)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4),[Data and Code Availability\.](https://arxiv.org/html/2606.26502#Sx2.SS5.SSS0.Px7.p1.1)\.
- Qwen Team \(2025\)QwQ\-32B and Qwen3\-235B\-A22B\-Thinking\-2507 \[model cards\]\.Note:Hugging FaceExternal Links:[Link](https://huggingface.co/Qwen)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- R\. Ratcliff and G\. McKoon \(2008\)The diffusion decision model: Theory and data for two\-choice decision tasks\.Neural Computation20\(4\),pp\. 873–922\.External Links:[Document](https://dx.doi.org/10.1162/neco.2008.12-06-420)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- S\. Russell and E\. Wefald \(1991\)Principles of metareasoning\.Artificial Intelligence49\(1–3\),pp\. 361–395\.External Links:[Document](https://dx.doi.org/10.1016/0004-3702%2891%2990015-C)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p1.5)\.
- S\. R\. Samineni, D\. Kalwar, V\. Gangal, S\. Bhambri, and S\. Kambhampati \(2025\)Local coherence or global validity? Investigating RLVR traces in math domains\.arXiv preprint arXiv:2510\.18176\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2510.18176)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px2.p1.2),[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p2.6),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- H\. A\. Simon \(1956\)Rational choice and the structure of the environment\.Psychological Review63\(2\),pp\. 129–138\.External Links:[Document](https://dx.doi.org/10.1037/h0042769)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.The Thirteenth International Conference on Learning Representations \(ICLR 2025\)\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2408.03314)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p2.1)\.
- K\. Valmeekam, K\. Stechly, V\. Palod, A\. Gundawar, and S\. Kambhampati \(2025\)Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens\.NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning \(LAW 2025\)\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.13775)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px2.p1.2),[§4\.5](https://arxiv.org/html/2606.26502#S4.SS5.p1.4)\.
- I\. I\. Vankov, F\. Adolfi, R\. F\. Heaton, G\. Puebla, and J\. S\. Bowers \(2026\)Correlations without causation do not support claims of human–LLM reasoning alignment\.Proceedings of the National Academy of Sciences123\(12\),pp\. e2536362123\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2536362123)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p3.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2201.11903)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p2.1)\.
- N\. Yeung and C\. Summerfield \(2012\)Metacognition in human decision\-making: Confidence and error monitoring\.Philosophical Transactions of the Royal Society B: Biological Sciences367\(1594\),pp\. 1310–1321\.External Links:[Document](https://dx.doi.org/10.1098/rstb.2011.0416)Cited by:[§1](https://arxiv.org/html/2606.26502#S1.p1.1),[§4\.3](https://arxiv.org/html/2606.26502#S4.SS3.p2.1)\.
- Z\.ai \(2025\)GLM\-4\.5\-Air\-FP8 \[model card\]\.Note:Hugging FaceExternal Links:[Link](https://huggingface.co/zai-org/GLM-4.5-Air-FP8)Cited by:[§2](https://arxiv.org/html/2606.26502#S2.SS0.SSS0.Px1.p1.4)\.
- J\. Zhu and T\. L\. Griffiths \(2025\)Computation\-limited Bayesian updating: a resource\-rational analysis of approximate Bayesian inference\.Psychological Review\.Note:Advance online publicationCited by:[§4\.2](https://arxiv.org/html/2606.26502#S4.SS2.p1.5)\.

## Supplementary Information

#### Contents\.

The twelve supplementary sections below extend, robustness\-test, or document the choices behind the four main\-text figures\. They are grouped here for navigation:

Per\-LRM and within–between decompositions of the H\-ARC dissociation:S1, per\-LRM item\-FE estimates;S2, Mundlak within/between decomposition;S3, difficulty\-controlled per\-agent forest plot;S4, aggregated\-cell robustness against within\-item human pseudo\-replication\.

Mechanism and intervention companions:S5, simulation details for the toy\-model truncation curves;S11, length\-controlled trace\-content regressions on H\-ARC and Cortes \(the source for Figure 4b in the main text\)\.

Data provenance and statistical\-specification choices:S6, LRM data provenance, dedup, and trace\-counting conventions;S7, primary\-estimand and identification\-scope rationale for the item\-FE specification;S8, implicit family of tests and multiple\-comparison treatment\.

Specification\-robustness tables and cross\-paradigm extensions:S9, H\-ARC dissociation across five specifications \(numerical companion to Figure 2c in the main text\);S10, cross\-paradigm item\-FE regressions and per\-LRM dissociation across H\-ARC, INTUIT, and Cortes;S12, robustness checks for the H\-ARC allocation gap and Cortes per\-LRM bootstrap intervals\.

### S1\. Per\-LRM item fixed\-effects dissociation on H\-ARC

For each thinking LRM, the combined dissociation regression \(iii\) ofThe within\-agent allocation gap on H\-ARCis fit on humans plus that single model\. The four well\-powered LRMs \(DeepSeek\-R1, GLM\-4\.5\-Air\-FP8, gpt\-oss\-120b, Qwen\-QwQ\-32B; allnLRM≥298n\_\{\\text\{LRM\}\}\\geq 298\) all return negative interactions with comfortable significance\. gpt\-oss\-20b \(n=119n=119, parser\-limited\) is also negative but reported with the parser caveat\. The Qwen3\-235B\-Thinking H\-ARC estimate is listed for completeness but should be treated as descriptive because it is identified from only four wrong trials; we do not read itspp\-value as inferential evidence\.

The per\-LRM Cohen’sddfor the four well\-powered cells \(in descending order: Qwen\-QwQ\-32B3\.133\.13, GLM\-4\.5\-Air\-FP82\.272\.27, DeepSeek\-R11\.811\.81, gpt\-oss\-120b1\.471\.47; abstract range\[1\.47,3\.13\]\[1\.47,3\.13\]\) and the parser\-limited gpt\-oss\-20b \(d=0\.95d=0\.95, widening the H\-ARC range to\[0\.95,3\.13\]\[0\.95,3\.13\]\) are visualised in Figure[2](https://arxiv.org/html/2606.26502#S3.F2)a; Qwen3\-235B\-Thinking \(nwrong=4n\_\{\\text\{wrong\}\}=4\) is not estimated\.

Table S1:H\-ARC per\-LRM item\-FE dissociation\(The within\-agent allocation gap on H\-ARC\)\.βLRM\\beta\_\{\\mathrm\{LRM\}\}is the within\-item correctness\-slope difference between the LRM and humans, holding item identity exactly constant; cluster\-robust SE by item\.†Descriptive cell only, not read as inferential evidence: gpt\-oss\-20b is parser\-limited \(n=119n=119, with parseable final outputs only\); Qwen3\-235B\-Thinking is identified from four wrong trials\.
### S2\. Mundlak within–between decomposition on H\-ARC

The within\-item slopeβW\\beta\_\{W\}is the coefficient oncc; the between\-item slope isβW\+βM\\beta\_\{W\}\+\\beta\_\{M\}, whereβM\\beta\_\{M\}is the coefficient onc¯i\\overline\{c\}\_\{i\}and provides the Hausman\-style test that within\- and between\-item slopes differ\.

Table S2:Mundlak within–between decomposition on H\-ARC\(The within\-agent allocation gap on H\-ARC\)\. The combined\-frame dissociation under Mundlak \(−0\.79\-0\.79\) is larger in magnitude than under FE \(−0\.66\-0\.66; the primary item\-FE estimand ofThe within\-agent allocation gap on H\-ARC\)\.
### S3\. Difficulty\-controlled within\-agent regression: per\-agent forest plot

The leave\-one\-out difficulty\-controlled regression \(Figure[S1](https://arxiv.org/html/2606.26502#S0.F1)\) returnsβcorrect<0\\beta\_\{\\mathrm\{correct\}\}\{\}<0atp<\.001p<\.001for the four well\-powered LRMs and is not reliably negative for the two underpowered cells \(gpt\-oss\-20b, Qwen3\-235B\-Thinking; SIS6\)\. Humans showβcorrect=\+0\.09\\beta\_\{\\mathrm\{correct\}\}\{\}=\+0\.09\(n=4,091n=4\{,\}091,p=\.001p=\.001\), opposite in sign to the LRMs\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x6.png)Figure S1:Difficulty\-controlled within\-agent regression on H\-ARC\(The within\-agent allocation gap on H\-ARC\)\. Each row shows the within\-agent correctness coefficientβcorrect\\beta\_\{\\mathrm\{correct\}\}\{\}after partialling out leave\-one\-out ensemble item difficulty\. Negative values mean that, holding item difficulty fixed at the level estimated from the other agents, the focal agent extends deliberation on its own wrong trials more than on its own right trials\. Annotations shownn\(analysable trials\) and thepp\-value for the correctness term\.
### S4\. Aggregated\-cell robustness for the H\-ARC dissociation

Because the public release does not retain participant identifiers \(seeMethods, Hierarchical mixed\-effects test on H\-ARC\), the trial\-level humann=4,091n=4\{,\}091contains within\-item pseudo\-replication: roughly ten human trials per item, no participant random intercept available\. We re\-estimate the combined item\-FE dissociation regression after collapsing each \(item×\\timescorrect∈\{0,1\}\\in\\\{0,1\\\}\) human cell to one row carrying the within\-cell meanlog⁡t\\log tand the cell size as a regression weight\. This brings the human contribution down from4,0914\{,\}091trial\-level rows to780780item\-level cells while preserving the within\-item identification of theβLRM\\beta\_\{\\mathrm\{LRM\}\}interaction\. LRM rows are left at the1,6371\{,\}637\(model×\\timesitem\) granularity \(one observation per cell already\)\. Cluster\-robust SE by item throughout\.

The aggregated\-cell estimates are essentially identical to the trial\-level result \(Table[S3](https://arxiv.org/html/2606.26502#S0.T3)\)\. Weighted by cell size:βLRM=−0\.655\\beta\_\{\\mathrm\{LRM\}\}=\-0\.655,95%95\\%CI\[−0\.82,−0\.49\]\[\-0\.82,\-0\.49\],p<\.001p<\.001\. Unweighted cell means:βLRM=−0\.606\\beta\_\{\\mathrm\{LRM\}\}=\-0\.606,95%95\\%CI\[−0\.74,−0\.47\]\[\-0\.74,\-0\.47\],p<\.001p<\.001\. The trial\-level estimate from the main text isβLRM=−0\.66\\beta\_\{\\mathrm\{LRM\}\}=\-0\.66\(item\-FE specification \(iii\)\)\. The dissociation does not depend on within\-item human pseudo\-replication\.

Table S3:Aggregated\-cell robustness on H\-ARC\.Human trials are collapsed to \(item×\\timescorrectness\) cells; LRM rows are left at one observation per \(model×\\timesitem\)\. The combined item\-FE dissociation regression is then re\-estimated\. Weighted by cell size or unweighted, the headline interaction coefficient is essentially unchanged from the trial\-level estimate\.
### S5\. Toy\-model truncation curves: simulation details

The toy simulation in Figure[5](https://arxiv.org/html/2606.26502#S4.F5)\(Discussion,How to test this further\) models items the agent normally solves at full chain lengthLi=Ui\+paddingiL\_\{i\}=U\_\{i\}\+\\text\{padding\}\_\{i\}, whereUiU\_\{i\}is useful compute andpaddingi≥0\\text\{padding\}\_\{i\}\\geq 0is residual\. Under the useful\-search interpretation,paddingi=0\\text\{padding\}\_\{i\}=0and the agent succeeds at truncation budgetffifffflies past a commitment position drawn fromBeta​\(8,2\)\\text\{Beta\}\(8,2\)\(mode≈0\.89\\approx 0\.89\): the agent has typically committed to its answer late in the chain\. Under the padding / length\-on\-uncertainty interpretation, padding fractionpi=paddingi/Uip\_\{i\}=\\text\{padding\}\_\{i\}/U\_\{i\}is drawn from0\.2\+1\.8⋅Beta​\(2,2\)0\.2\+1\.8\\\!\\cdot\\\!\\text\{Beta\}\(2,2\), and the agent succeeds ifff≥1/\(1\+pi\)f\\geq 1/\(1\+p\_\{i\}\)\. The two readings predict qualitatively different curves regardless of the specific distributional choices; the simulation is illustrative, not a fit\.

### S6\. LRM data provenance and trace counting

The released frame \(de Vardaet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib4)\) provides, per item: the model’s prompt, its full output \(reasoning trace plus final answer\), the fieldreasoning\_token\_length, and supplementary fields\. The deliberation measurettisreasoning\_token\_length\(intermediate\-trace tokens before the final answer, in each model’s native tokenizer\) for thinking LRMs andtotal\_output\_tokens\(de Vardaet al\.,[2025](https://arxiv.org/html/2606.26502#bib.bib4), harc\.py\) for the non\-thinking DeepSeek\-V3 control\. We do not retokenize across models; between\-LRM comparisons of rawttare tokenizer\-confounded, so between\-LRM use ofttis restricted to the within\-agent contrast\. Final\-answer correctness is scored against the released gold answer with the parser used by the de Varda et al\. analysis script, augmented for grid\-format H\-ARC output by a regular\-expression match against theCorrect\_answerstring\. Per\-model attempted, parseable, correct, and wrong counts on each paradigm are listed in the OSF deposit \(seeData and Code Availabilityin the main paper\)\.

Two H\-ARC model×\\timesparadigm cells warrant explicit caveats\.gpt\-oss\-20b on H\-ARC\.Only119119of400400H\-ARC items have a parseable final grid for this model; on the remaining items the final\-grid parser fails\. Whether the parser failure rate is independent of correctness is not directly testable \(a failed parse has no correctness label\), and we cannot rule out selection on length or content\. We therefore report this cell descriptively and not as inferential evidence\.Qwen3\-235B\-Thinking on H\-ARC\.Only4343of the400400H\-ARC items appear in the released frame for this model \(the de Varda et al\. release matches LRM\-side runs to the H\-ARC item set on the prompt\-template hash; the remaining items are absent for this model in the released subset\)\. Of those4343items, only44are wrong trials, so this cell is underpowered for any per\-LRM inference and we treat it as descriptive\.

### S7\. Statistical specification choices

Throughout the paper we treat the item fixed\-effects estimate \(identified purely from within\-item variation, with cluster\-robust SE by item\) as the primary difficulty\-controlled estimand for the agent\-type interactionβLRM\\beta\_\{\\mathrm\{LRM\}\}\. The mixed\-effects variants \(item random intercept; crossed agent\+\+item random intercepts\) and the cluster\-robust OLS rows in Table[S4](https://arxiv.org/html/2606.26502#S0.T4)are reported as robustness checks\. They differ in coefficient magnitude \(the crossed random\-effects refit, in particular, produces a more negative point estimate than the item random\-intercept specification\) because they place different weights on between\-item information and impose different shrinkage on the agent\-level main effect; we do not interpret these magnitude differences as substantively informative and we do not choose between them\. The dissociation interaction is negative and well\-bounded away from zero under every specification we ran\.

The agent\-label clustered specification is reported as a finite\-cluster sensitivity check rather than a primary inferential test, because the number of agent\-label clusters is small \(humans plus four well\-powered LRMs\)\. The pooled human\+\+LRM regression has unequal cluster sizes \(nhuman=4,091n\_\{\\text\{human\}\}=4\{,\}091vs\.nLRM=1,637n\_\{\\text\{LRM\}\}=1\{,\}637on H\-ARC\); the imbalance does not by itself determine the sign of the within\-item interaction, which is identified from variation across the agents attempting the same item rather than from main\-effect comparisons of group means\.

The released LRM data contain one observed outcome per \(item×\\timesmodel\) pair, so the within\-item LRM correctness slope in the LRMs\-pooled regression is identified by variation across LRMs on the same item, and the per\-LRM combined regression is identified by variation across humans plus one LRM on the same item\. Neither specification estimates a repeated stochastic within\-item slope for a single model; a per\-model stochastic estimate would require multiple LRM samples per item, which the public release does not provide\.

### S8\. Multiple comparisons: implicit family of tests

We do not apply family\-wise multiple\-comparison correction to the p\-values reported in the main text\. The two headline dissociations are the H\-ARC item\-FE interaction \(p<\.001p<\.001, withp≈4×10−17p\\approx 4\\times 10^\{\-17\}in the deposited results\) and the INTUIT replication \(p<\.001p<\.001\); both survive Bonferroni and Holm corrections by many orders of magnitude irrespective of the implicit family size\. Several borderline secondary cells are flagged as descriptive throughout the main text and should not be interpreted as inferential evidence atα=\.05\\alpha=\.05once the implicit family of tests is acknowledged: the gpt\-oss\-20b and Qwen3\-235B\-Thinking H\-ARC per\-LRM dissociations, the INTUIT humans\-only correctness slope \(p=\.087p=\.087\), and the harder\-items human\-engagement quartile \(p=\.05p=\.05\)\. The cross\-paradigm rank\-stability summary is descriptive \(Spearmanρ\\rhoonn=6n=6models per pair\)\. The trace\-content regressions in Table[S7](https://arxiv.org/html/2606.26502#S0.T7)report inferentialpp\-values; the two bolded headline rows \(H\-ARC self\-doubt density and Cortes55\-gram repetition / type–token ratio\) survive Bonferroni correction over the eight tested feature×\\timesparadigm cells, and we read them as inferential\. The non\-bolded rows are reported descriptively\.

### S9\. H\-ARC dissociation across specifications

Numerical values for the five specifications visualised in Figure[2](https://arxiv.org/html/2606.26502#S3.F2)c are reproduced in Table[S4](https://arxiv.org/html/2606.26502#S0.T4); the bolded row is the primary item\-fixed\-effects estimand\.

Table S4:H\-ARC dissociation interaction across specifications\.Centred correctness×\\timesLRM interaction\. The bolded row is the primary difficulty\-controlled estimand \(within\-item identification, cluster\-robust SE by item\)\. The four upper rows are mixed\-effects and cluster\-robust anchors that triangulate the same interaction without imposing strict within\-item identification\. The agent\-clustered OLS row is a finite\-cluster sensitivity check, because the number of agent\-label clusters is small \(humans plus the well\-powered LRMs\)\. All five specifications agree in sign\.
### S10\. Cross\-paradigm item\-FE regressions and per\-LRM dissociation

Item fixed\-effects regressions on the three non\-saturated paradigms underpin the cross\-paradigm summary inGeneralisation and the Cortes boundary\(Figure[3](https://arxiv.org/html/2606.26502#S3.F3)\)\. Table[S5](https://arxiv.org/html/2606.26502#S0.T5)reports the three within\-item regressions per paradigm \(humans\-only, LRMs\-pooled, and the combined dissociationβLRM\\beta\_\{\\mathrm\{LRM\}\}\)\. The within\-item LRM slope on Cortes is positive \(\+0\.31\+0\.31\), opposite to its sign on H\-ARC and INTUIT, but the dissociation interactionβLRM\\beta\_\{\\mathrm\{LRM\}\}remains highly negative on every analysable paradigm\. Arithmetic is undefined because all thinking LRMs solve all items\.

Table S5:Cross\-paradigm item\-FE regressions\.Three within\-item regressions per paradigm, with cluster\-robust SE by item\. The within\-item LRM slope on Cortes \(bold\) is positive, opposite to its sign on H\-ARC and INTUIT, but the dissociation interactionβLRM\\beta\_\{\\mathrm\{LRM\}\}remains highly negative across all analysable paradigms\.The same combined dissociation regression refit on humans plus each single LRM at a time is reported in Table[S6](https://arxiv.org/html/2606.26502#S0.T6)\. The four well\-powered H\-ARC LRMs each produce a negative interaction atp<\.001p<\.001in the per\-LRM regression\. The two underpowered cells \(gpt\-oss\-20b atn=119n=119, parser\-limited; Qwen3\-235B\-Thinking atn=43n=43matched items with four wrong trials\) also point negative on H\-ARC but are reported descriptively\. The same per\-LRM regressions on INTUIT and Cortes preserve the rank order: Spearmanρ=\+0\.89\\rho=\+0\.89between H\-ARC and Cortes \(p=\.02p=\.02\),\+0\.77\+0\.77between H\-ARC and INTUIT \(p=\.07p=\.07\), and\+0\.77\+0\.77between INTUIT and Cortes \(p=\.07p=\.07\);n=6n=6models per pair\. Qwen\-QwQ\-32B is the strongest\-dissociating model in every paradigm\.

Table S6:Per\-LRM dissociationβLRM\\beta\_\{\\mathrm\{LRM\}\}across paradigms\.Combined regression refit on humans plus each single LRM, with item fixed effects and cluster\-robust SE by item\. For cross\-paradigm rank\-comparability the H\-ARC column is computed on the items in common across all six LRMs \(so the underpowered gpt\-oss\-20b and Qwen3\-235B\-Thinking H\-ARC cells use the LRM’s attempted\-item subset rather than the full400400\-item frame\); the unrestricted per\-LRM H\-ARC dissociation values, essentially identical for the four well\-powered LRMs and slightly more negative than the matched\-items values for the two underpowered cells \(gpt\-oss\-20b−0\.815\-0\.815vs\.−0\.78\-0\.78; Qwen3\-235B\-Thinking−0\.323\-0\.323vs\.−0\.27\-0\.27\), are reported in Table[S1](https://arxiv.org/html/2606.26502#S0.T1)\. Bolded row is the strongest\-dissociating model on each paradigm\.
### S11\. Trace\-content regressions, length\-controlled

For each LRM trial on H\-ARC and Cortes \(INTUIT traces are restricted in the public release\) we compute four length\-normalised features from the releasedreasoning\_trace: self\-doubt / hedge marker density per1,0001\{,\}000characters, self\-correction marker density, word\-level55\-gram repetition rate, and type–token ratio\. For each feature we fitfeature∼correct\+log⁡w\+C​\(agent\)\+C​\(item\)\\text\{feature\}\\sim\\text\{correct\}\+\\log w\+C\(\\text\{agent\}\)\+C\(\\text\{item\}\)wherewwis trace length in words, with cluster\-robust SE by item\. Table[S7](https://arxiv.org/html/2606.26502#S0.T7)reports the eight resulting coefficients\. The two bolded headline rows \(H\-ARC self\-doubt density; Cortes55\-gram repetition rate\) underwrite the convergent surface\-marker reading inInside the dissociation: human engagement, LRM length\-on\-uncertainty; the type–token ratio rows are consistent with the same length\-on\-uncertainty pattern but reported descriptively\.

Table S7:Trace\-content regressions, length\-controlled\.For each feature,β\\betais the coefficient on correctness infeature∼correct\+log⁡w\+C​\(agent\)\+C​\(item\)\\text\{feature\}\\sim\\text\{correct\}\+\\log w\+C\(\\text\{agent\}\)\+C\(\\text\{item\}\)with cluster\-robust SE by item\. A negativeβ\\betaon the self\-doubt / hedge density or repetition row means the feature ishigheron wrong trials\. Bolded rows are the headline content asymmetries on each paradigm\. INTUIT traces are restricted in the public release\.
### S12\. Robustness checks and Cortes bootstrap intervals

The H\-ARC allocation gap referenced inRobustnessis supported by three additional checks \(Figure[S2](https://arxiv.org/html/2606.26502#S0.F2)\)\. The trial\-level cross\-item Spearman correlation between LRM reasoning\-token length and human reaction time is positive across the well\-powered LRMs and absent for the non\-thinking V3 baseline \(panel a\)\. The LRM d\-ratio remainsd\>0\.5d\>0\.5in nearly every model×\\timespopulation\-accuracy quartile cell, including the easiest quartile \(panel b\)\. The human H\-ARC d\-ratio is−0\.10\-0\.10at the trial level and\+0\.60\+0\.60at the item\-level split: the latter recreates exactly the cross\-item difficulty confound the within\-agent contrast is designed to avoid \(panel c\)\. Cortes per\-LRM bootstrap confidence intervals\(Efron and Tibshirani,[1993](https://arxiv.org/html/2606.26502#bib.bib31)\)on the within\-agent d\-ratio are wide, reflecting55–88wrong trials per model, but exclude zero for all six thinking LRMs \(Figure[S3](https://arxiv.org/html/2606.26502#S0.F3)\)\.

![Refer to caption](https://arxiv.org/html/2606.26502v1/x7.png)Figure S2:Robustness checks for the H\-ARC allocation gap\.\(a\) Cross\-item Spearman correlation between LRM reasoning\-token length and human reaction time on H\-ARC; DeepSeek\-V3 is the non\-thinking baseline and shows no positive correlation\. \(b\) LRM d\-ratios by human\-difficulty quartile; cells marked “NA” lack enough right or wrong trials in that cell to estimate a d\-ratio, and should not be read asd=0d=0\. \(c\) Human d\-ratio under trial\-level \(primary\) vs\. item\-level \(confounded\) operationalisation; the negative tick on the x\-axis is shown explicitly so that the trial\-level point is read as negative rather than as approximately zero\.![Refer to caption](https://arxiv.org/html/2606.26502v1/x8.png)Figure S3:Cortes bootstrap uncertainty\.Points are Cortes d\-ratios per agent; horizontal bars are95%95\\%bootstrap confidence intervals from10,00010\{,\}000resamples\. Per\-LRM wrong\-trial counts are annotated on the right of each interval\. The dashed line marks the human Cortes d\-ratio\. The non\-thinking DeepSeek\-V3 baseline is included for reference\.

Similar Articles

Reasoning Can Be Restored by Correcting a Few Decision Tokens

arXiv cs.AI

This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

arXiv cs.CL

This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.