Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL 05/18/26, 04:00 AM Papers
Summary
This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.
arXiv:2605.15454v1 Announce Type: new Abstract: Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:31 AM
# Reasoning Models Don’t Just Think Longer, They Move Differently
Source: [https://arxiv.org/html/2605.15454](https://arxiv.org/html/2605.15454)
Anders Gjølbye1,2Lars Kai Hansen1Sanmi Koyejo2 1Technical University of Denmark2Stanford University gjoelbye@cs\.stanford\.edulkai@dtu\.dksanmi@cs\.stanford\.edu

###### Abstract

Reasoning\-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory\. We study this distinction through hidden\-state trajectories during chain\-of\-thought generation across competitive programming, mathematics, and Boolean satisfiability\. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty\-dependent comparisons are misleading without adjustment\. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied\. The clearest reasoning\-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning\-trained models than in matched instruction\-tuned baselines\. Corrected difficulty\-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability\. Prompt\-stage linear probes do not mirror the code\-domain separation, and behavioral annotations show that stronger corrected coupling co\-occurs with strategy shifts and uncertainty monitoring\. Together, these findings establish length correction as a prerequisite for generation\-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain\.

## 1Introduction

Reasoning\-trained LLMs often spend more test\-time compute on harder problems, producing substantially longer chains of thought and sometimes thousands of unnecessary tokens on easy ones\(Chenet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib3); Wanget al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib4); Snellet al\.,[2024](https://arxiv.org/html/2605.15454#bib.bib5)\)\. Longer traces, however, do not reveal whether a model is merely computing for more steps or following a different internal path\. Output length alone cannot distinguish these possibilities: a model may extend the same process for longer, or its hidden\-state trajectory may change systematically with problem difficulty\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x1.png)Figure 1:Hidden\-state trajectory geometry during chain\-of\-thought generation\. Left, autoregressive hidden\-state trajectories are extracted from matched reasoning and non\-reasoning models on the same problem\. Center, raw trajectory geometry is dominated by generation length: longer trajectories, which are more common on harder problems, appear mechanically less direct regardless of model type\. Right, Codeforces illustrates the main reasoning\-baseline contrast: raw directness\-difficulty correlations are negative across models, while length\-adjusted correlations separate reasoning models from matched baselines\. Full cross\-domain results for code, math, and SAT are reported in Figure[2](https://arxiv.org/html/2605.15454#S4.F2)\.This distinction matters for interpreting reasoning training\. If reasoning models differ from their baselines mainly by allocating more test\-time compute, then recent gains may largely reflect better control over computation amount\. If difficulty remains coupled to trajectory shape after accounting for length, then reasoning training may also be associated with changes in how computation unfolds during generation\. Existing work has mostly studied this issue through outputs, test\-time compute allocation, and failure modes such as overthinking or underthinking\(Chenet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib3); Wanget al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib4); Snellet al\.,[2024](https://arxiv.org/html/2605.15454#bib.bib5)\)\. We instead ask whether difficulty\-dependent differences are visible in hidden\-state trajectories during chain\-of\-thought generation\.

The central complication is that the most intuitive geometric signal is also the easiest to misread\. Longer paths are mechanically less direct, a phenomenon well characterized in movement ecology\(Benhamou,[2004](https://arxiv.org/html/2605.15454#bib.bib13); Codlinget al\.,[2008](https://arxiv.org/html/2605.15454#bib.bib14)\)but largely unaddressed in generation\-time analyses of LLM representations\. Since harder problems also elicit longer generations, raw geometry can make hard problems appear less organized simply because their trajectories contain more steps\. To avoid this confound, we calibrate item difficulty with an IRT model, extract hidden\-state trajectories over generated solution segments, and measure within\-model difficulty\-geometry coupling after residualizing trajectory statistics on generation length\. We then compare this corrected coupling across matched reasoning and instruction\-tuned baseline models\.

This length\-corrected view changes the qualitative result\. Before correction, harder problems tend to have less direct trajectories\. After residualizing on generation length, the relationship reverses across competitive programming, mathematics, and Boolean satisfiability: harder problems elicit more direct corrected trajectories\. The reversal is therefore not only a code\-domain effect, but a cross\-domain warning that raw generation\-time geometry must be interpreted with explicit length correction\.

Corrected geometry also separates model classes, but unevenly across domains\. In competitive programming, all six reasoning\-trained models show positive corrected directness\-difficulty coupling, while matched baselines remain near zero \(reasoning medianρ⟂D=\+0\.41\\rho\_\{\\perp\}^\{D\}=\+0\.41, baseline−0\.06\-0\.06\)\. In mathematics, the separation is weaker and more heterogeneous \(\+0\.05\+0\.05vs\.−0\.07\-0\.07\)\. In Boolean satisfiability, both reasoning and baseline models show positive corrected coupling \(medians\+0\.27\+0\.27and\+0\.23\+0\.23\), indicating that corrected difficulty\-geometry coupling can also emerge in instruction\-tuned baselines\. The clearest reasoning\-specific contrast is therefore in competitive programming\.

Two additional analyses help interpret this geometric signal\. First, prompt\-stage linear probes do not show the same reasoning\-baseline separation as corrected geometry in the code domain, suggesting that the effect is not simply stronger linear access to difficulty before generation\. Second, sentence\-level behavioral annotations from independent LLM judges show that stronger geometric coupling co\-occurs with strategy shifting and uncertainty monitoring\. These behavioral analyses are descriptive rather than causal, since the annotations and geometry are measured from the same generated traces\.

Our contributions are:\(i\)identifying generation length as a structural confound in generation\-time trajectory geometry;\(ii\)introducing a length\-corrected analysis showing that difficulty remains coupled to corrected geometry across competitive programming, mathematics, and Boolean satisfiability;\(iii\)showing that reasoning\-specific separation is domain\-dependent, clearest in competitive programming, while corrected difficulty–geometry coupling persists more weakly elsewhere;\(iv\)relating the signal to probes and observable reasoning behaviors: linear difficulty decodability does not track the code\-domain separation, while stronger corrected coupling co\-occurs with strategy shifting and uncertainty monitoring; and\(v\)a large\-scale trajectory archive, to be released publicly, pairing generated chain\-of\-thought traces with sampled generation\-time hidden\-state trajectories for the matched reasoning and instruction\-tuned models\.

## 2Related Work

This paper lies at the intersection of three lines of work: geometric analyses of internal representations, studies of difficulty in LLMs, and work on difficulty\-dependent reasoning behavior at inference time\.

#### Trajectory geometry in LLMs\.

Recent work has used trajectory geometry to study the structure of computation in LLM representations\.Hosseini and Fedorenko \([2023](https://arxiv.org/html/2605.15454#bib.bib12)\)showed that LLMs progressively straighten sentence\-level trajectories across layers, paralleling temporal straightening in biological neural systems\.Zhouet al\.\([2026](https://arxiv.org/html/2605.15454#bib.bib10)\)formalized reasoning as geometric flows in representation space, showing that curvature captures logical structure under carrier\-invariant designs\.Damirchiet al\.\([2026](https://arxiv.org/html/2605.15454#bib.bib11)\)found that full displacement vectors across layers outperform scalar kinematic descriptors for predicting reasoning validity\. These studies establish geometry as a useful lens on internal computation, but they focus on*fixed\-depth*trajectories across layers for a single token\. Our setting is different: we study generation\-time trajectories across tokens at a fixed layer, where path length varies systematically across examples\. This makes generation length a central methodological concern, since geometric metrics can change mechanically with trajectory length\.Sunet al\.\([2026](https://arxiv.org/html/2605.15454#bib.bib43)\)characterize reasoning as trajectories through step\-specific representation subspaces, showing that correct and incorrect solutions diverge at late steps and that trajectory\-based steering can redirect reasoning\. Our question is complementary: we study token\-time trajectories at a fixed layer rather than layer\-indexed step representations, and ask whether problem difficulty modulates trajectory geometry after removing the mechanical effects of generation length, a confound not addressed in step\-indexed analyses\.

#### Difficulty in LLMs\.

A separate line of work studies how LLMs encode or measure problem difficulty\. Linear probes can decode difficulty from hidden states with high accuracy\(Lugoloobi and Russell,[2025](https://arxiv.org/html/2605.15454#bib.bib7)\)\. IRT has also been adopted for LLM benchmarking and evaluation\(Poloet al\.,[2024](https://arxiv.org/html/2605.15454#bib.bib25); Zhouet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib26); Xuet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib27)\)\.Zhuet al\.\([2025](https://arxiv.org/html/2605.15454#bib.bib8)\)estimated model\-perceived difficulty from hidden representations via a value\-function framework, whileLeeet al\.\([2025](https://arxiv.org/html/2605.15454#bib.bib9)\)identified attention heads with distinct activation patterns for easy versus hard problems\. These works show that difficulty is represented in model internals and can be measured continuously\. Our goal, however, is not to show that difficulty is encoded, but to use a continuous difficulty variable to study how internal computation changes across problems\.

#### Difficulty\-dependent reasoning behavior\.

Work on overthinking, underthinking, and inference\-time compute has shown that reasoning models allocate computation differently across easy and hard problems\.Snellet al\.\([2024](https://arxiv.org/html/2605.15454#bib.bib5)\)showed that optimal compute allocation depends on difficulty\.Chenet al\.\([2025](https://arxiv.org/html/2605.15454#bib.bib3)\)documented overthinking on easy problems, whileWanget al\.\([2025](https://arxiv.org/html/2605.15454#bib.bib4)\)identified underthinking on hard problems;Suet al\.\([2025](https://arxiv.org/html/2605.15454#bib.bib6)\)showed that both behaviors can coexist\.Huanget al\.\([2025](https://arxiv.org/html/2605.15454#bib.bib29)\)linked overthinking to a low\-dimensional activation manifold and proposed steering\-based mitigation\. These works primarily characterize difficulty\-dependent adaptation through outputs or pathological regimes\. Our paper asks the complementary internal question: whether reasoning training changes the geometry of the generation\-time trajectory itself, across the full difficulty continuum and after controlling for response length\.

Taken together, these literatures motivate geometry, difficulty, and inference\-time adaptation as relevant lenses, but leave open whether reasoning training changes generation\-time internal dynamics as a function of problem difficulty once the strong response\-length confound is removed\.

## 3Experimental Setup

We use a matched design to separate four quantities that are otherwise entangled: problem difficulty, generation length, model class, and trajectory geometry\. We define comparable item sets across three domains, calibrate a continuous difficulty scale within each domain, compare matched reasoning and instruction\-tuned model pairs on the same items, and extract hidden\-state trajectories from generated solution segments\.

Datasets\.We evaluate on 500 Easy2Hard\-Bench competitive\-programming problems\(Dinget al\.,[2024](https://arxiv.org/html/2605.15454#bib.bib18)\), 500 MATH problems\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.15454#bib.bib19)\), and 500 SATBench problems\(Weiet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib45)\)\. SATBench items are stratified into five clause\-count bins spanning 4–45 clauses and are approximately balanced between satisfiable and unsatisfiable instances within each bin\. This yields 1,500 items across competitive programming, mathematics, and Boolean satisfiability\.

Difficulty calibration\.Native difficulty labels are platform\-specific \(Codeforces Glicko\-2 ratings\), coarsely ordinal \(MATH levels 1–5\), or structural \(SAT clause counts; SATBench clause count is the dominant proxy for instance hardness in the synthetic regime studied here\)\. To obtain a continuous latent difficulty scale within each domain, we fit a Rasch model\(Rasch,[1960](https://arxiv.org/html/2605.15454#bib.bib17)\)with a binomial likelihood over repeated runs:

kij∼Binomial\(nij,σ\(θj−bi\)\),k\_\{ij\}\\sim\\mathrm\{Binomial\}\\bigl\(n\_\{ij\},\\;\\sigma\(\\theta\_\{j\}\-b\_\{i\}\)\\bigr\),\(1\)wherekijk\_\{ij\}is the number of correct completions by modeljjon itemii, andbib\_\{i\}is item difficulty\. IRT is calibrated separately per domain from 32 models and validated against external labels: Spearmanρ=0\.55\\rho=0\.55with Codeforces ratings,ρ=0\.43\\rho=0\.43with MATH levels, andρ=0\.56\\rho=0\.56\(r=0\.58r=0\.58\) with SAT clause counts\. We usebib\_\{i\}as the continuous independent variable throughout\. Appendix[A\.6](https://arxiv.org/html/2605.15454#A1.SS6)reports calibration diagnostics, external\-label agreement, 1PL–2PL comparisons, and leave\-one\-out recalibration checks\.

Matched model pairs\.The core analysis uses six matched reasoning\-baseline comparisons across Qwen, Llama, and Phi families, with three reasoning\-training recipes: R1 distillation, SFT\+RL, and o3\-mini distillation\. These six comparisons contain five unique baseline models because Qwen2\.5\-32B\-Instruct serves as the shared baseline for both R1\-Distill\-Qwen\-32B and QwQ\-32B\. Pair\-level counts use six matched comparisons; unique\-baseline counts use five baseline models\. We state which convention is used wherever counts are reported\. The 32B shared\-base comparison is especially clean because R1\-Distill\-Qwen\-32B and QwQ\-32B differ in reasoning\-training recipe while sharing the same instruction\-tuned baseline\.

Table 1:Matched model pairs used in the main comparison\.Trajectory extraction overview\.We extract hidden states at five evenly spaced layers for five runs per problem per model, with 30 runs for R1\-Distill\-Qwen\-7B in stability analyses\. Unless otherwise stated, main figures report the median statistic across these five prespecified sampled layers; layer\-specific results are reported in Appendix[D\.2](https://arxiv.org/html/2605.15454#A4.SS2)\. We distinguish three representational levels: prompt\-stage representations measured at the final prompt token before generation, generation\-time trajectories measured over the generated solution segment, and output\-level behavior measured from generated traces and correctness outcomes\. Correctness is evaluated by code execution for competitive programming, symbolic matching of boxed answers for MATH, and pattern matching ofSATISFIABLE/UNSATISFIABLEmarkers against the ground\-truth label for SAT\.

Trajectory archive\.We will make the full sampled\-trajectory archive used in this study publicly available in a subsequent release\.111Repository will be released together with the archive\.The approximately 3 TB archive pairs generated chain\-of\-thought traces with sampled generation\-time hidden\-state trajectories for the matched reasoning and instruction\-tuned models, indexed by item, model, run, layer, and token position\. To our knowledge, this is the first large\-scale public resource pairing generated reasoning traces with generation\-time hidden\-state trajectories across matched reasoning and instruction\-tuned models\.

## 4Trajectory Geometry and Length Correction

Generated solution segments\.For problemii, modelmm, runrr, layerℓ\\ell, and generation steptt, let𝐡imr,t\(ℓ\)∈ℝd\\mathbf\{h\}\_\{imr,t\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}be the post\-block residual\-stream output at the final generated\-token position\. We restrict analysis to the generated solution segment\. For reasoning\-trained models with explicit thinking delimiters, this segment is the delimited thinking block\. For instruction\-tuned baselines, no native thinking delimiter exists, so we define the generated solution segment as the pre\-answer text before the first detected answer boundary\. In code, the boundary is the first code fence; in math, the first final\-answer marker such as`\\boxed\{\}`; in SAT, the firstSATISFIABLE/UNSATISFIABLEmarker; XML answer tags are fallback markers\. If no boundary is detected, the full baseline output is treated as the pre\-answer segment\. For tagged reasoning models, malformed or missing thinking delimiters are treated as answer\-only and therefore produce zero\-length reasoning segments; empirically, tagged\-but\-empty cases were not observed in any reasoning model/domain setting\. Because reasoning models often provide explicit thinking delimiters whereas baselines require inferred answer boundaries, Appendix[D\.1](https://arxiv.org/html/2605.15454#A4.SS1)reports boundary\-detection rates, fallback rates, and boundary\-policy sensitivity checks\.

We useNimrN\_\{imr\}for raw generated solution\-segment token length andTimrT\_\{imr\}for sampled trajectory length after stride\-based hidden\-state sampling\. All main trajectory analyses use stride 10\. Runs with too few sampled states for a statistic are excluded from that statistic, with exclusion rates reported alongside the segmentation diagnostics\. Curvature\-based statistics require at least three sampled states\.

Directness\.For trajectory\(𝐡imr,0\(ℓ\),…,𝐡imr,Timr\(ℓ\)\)\(\\mathbf\{h\}\_\{imr,0\}^\{\(\\ell\)\},\\dots,\\mathbf\{h\}\_\{imr,T\_\{imr\}\}^\{\(\\ell\)\}\), define path length and net displacement:

Limr\(ℓ\):=∑t=1Timr‖𝐡imr,t\(ℓ\)−𝐡imr,t−1\(ℓ\)‖2,Δimr\(ℓ\):=‖𝐡imr,Timr\(ℓ\)−𝐡imr,0\(ℓ\)‖2\.L\_\{imr\}^\{\(\\ell\)\}:=\\sum\_\{t=1\}^\{T\_\{imr\}\}\\left\\\|\\mathbf\{h\}\_\{imr,t\}^\{\(\\ell\)\}\-\\mathbf\{h\}\_\{imr,t\-1\}^\{\(\\ell\)\}\\right\\\|\_\{2\},\\quad\\Delta\_\{imr\}^\{\(\\ell\)\}:=\\left\\\|\\mathbf\{h\}\_\{imr,T\_\{imr\}\}^\{\(\\ell\)\}\-\\mathbf\{h\}\_\{imr,0\}^\{\(\\ell\)\}\\right\\\|\_\{2\}\.Directness is

Dimr\(ℓ\)=Δimr\(ℓ\)/Limr\(ℓ\)∈\[0,1\]\.D\_\{imr\}^\{\(\\ell\)\}=\\Delta\_\{imr\}^\{\(\\ell\)\}/L\_\{imr\}^\{\(\\ell\)\}\\in\[0,1\]\.Directness is our primary interpretable statistic: it measures endpoint efficiency relative to the path actually taken\.

Curvature variability\.Letκimr,t\(ℓ\)\\kappa\_\{imr,t\}^\{\(\\ell\)\}be Menger curvature over consecutive triples\. We define curvature variability as

Vimr\(ℓ\):=sd⁡\(κimr,1\(ℓ\),…,κimr,Timr−1\(ℓ\)\)\.V\_\{imr\}^\{\(\\ell\)\}:=\\operatorname\{sd\}\(\\kappa\_\{imr,1\}^\{\(\\ell\)\},\\dots,\\kappa\_\{imr,T\_\{imr\}\-1\}^\{\(\\ell\)\}\)\.Curvature variability is a robustness\-oriented local descriptor\. Whereas directness summarizes endpoint efficiency, curvature variability measures heterogeneity in local bending and is less tied to a single net displacement\. Negativeρ⟂V\\rho\_\{\\perp\}^\{V\}indicates less heterogeneous local bending after length correction, not necessarily less total turning\. As auxiliary checks, we also analyze two intrinsic\-dimensionality metrics of the same trajectories, TwoNN and PCA90, using the same raw\-versus\-corrected correlation framework\. Full setup and interpretation are reported in Appendix[C\.2](https://arxiv.org/html/2605.15454#A3.SS2)\.

Length\-residualized difficulty\-geometry coupling\.Within each domain, model, and sampled layer, we average over runs to obtainD¯im\(ℓ\)\\bar\{D\}\_\{im\}^\{\(\\ell\)\}andV¯im\(ℓ\)\\bar\{V\}\_\{im\}^\{\(\\ell\)\}for each item\. Letbib\_\{i\}be the domain\-specific IRT difficulty andNimN\_\{im\}the mean solution\-segment token length\. We fit a length\-only regression separately for each model\-layer pair:

D¯im\(ℓ\)=β0m\(ℓ\)\+β1m\(ℓ\)log⁡Nim\+εim\(ℓ\)\.\\bar\{D\}\_\{im\}^\{\(\\ell\)\}=\\beta\_\{0m\}^\{\(\\ell\)\}\+\\beta\_\{1m\}^\{\(\\ell\)\}\\log N\_\{im\}\+\\varepsilon\_\{im\}^\{\(\\ell\)\}\.\(2\)Define the residualized componentD⟂,im\(ℓ\):=D¯im\(ℓ\)−D^∥,im\(ℓ\)D\_\{\\perp,im\}^\{\(\\ell\)\}:=\\bar\{D\}\_\{im\}^\{\(\\ell\)\}\-\\hat\{D\}\_\{\\parallel,im\}^\{\(\\ell\)\}, whereD^∥,im\(ℓ\)=β^0m\(ℓ\)\+β^1m\(ℓ\)log⁡Nim\\hat\{D\}\_\{\\parallel,im\}^\{\(\\ell\)\}=\\hat\{\\beta\}\_\{0m\}^\{\(\\ell\)\}\+\\hat\{\\beta\}\_\{1m\}^\{\(\\ell\)\}\\log N\_\{im\}is the OLS\-fit length component from Eq\.[2](https://arxiv.org/html/2605.15454#S4.E2)\. Our primary estimand is

ρ⟂,mD,\(ℓ\):=ρS\(bi,D⟂,im\(ℓ\)\)\.\\rho\_\{\\perp,m\}^\{D,\(\\ell\)\}:=\\rho\_\{S\}\\\!\\left\(b\_\{i\},D\_\{\\perp,im\}^\{\(\\ell\)\}\\right\)\.\(3\)The resulting statisticρ⟂D\\rho\_\{\\perp\}^\{D\}measures difficulty\-geometry coupling during generation after removing the fitted length component\. We apply the same residualization procedure to curvature variability to obtainρ⟂V\\rho\_\{\\perp\}^\{V\}\.

Interpretation depends on separating geometry from length, so Appendix[C\.1](https://arxiv.org/html/2605.15454#A3.SS1)reports alternative residualizations and length\-matched analyses, includingD∼N−1/2D\\sim N^\{\-1/2\},log⁡D∼log⁡N\\log D\\sim\\log N, length\-binned matching, andTT\-based variants using sampled trajectory length rather than raw token length\. These checks address the functional form of the length correction; segmentation and layer sensitivity are reported separately in Appendices[D\.1](https://arxiv.org/html/2605.15454#A4.SS1)and[D\.2](https://arxiv.org/html/2605.15454#A4.SS2)\.

Prompt\-stage difficulty decodability\.To assess whether generation\-stage coupling is mirrored by stronger linear difficulty information before generation, we extract the hidden state at the final prompt token and train Ridge probes to predict IRT difficulty\. For each domain and model\-layer pair, probes are trained and evaluated on held\-out item splits using the same targetsbib\_\{i\}; the resulting cross\-validated prediction score is used as the prompt\-stage decodability measure\. We average over runs to obtain one row per item before training, then use 5\-fold cross\-validation by item\. Prompt\-stage decodability and generation\-stage geometric coupling are different estimands: prompt probes measure linear accessibility of difficulty information before generation, whileρ⟂D\\rho\_\{\\perp\}^\{D\}measures how difficulty is coupled to trajectory shape during generation\. Their comparison is diagnostic rather than causal\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x2.png)Figure 2:Length correction reveals a sign reversal across all three domains\. Hollow circles show raw Spearman correlations with IRT difficulty; filled squares show length\-corrected correlationsρ⟂\\rho\_\{\\perp\}after residualizing onlog⁡N\\log N\. Panels \(a–c\) show directness \(ρ⟂D\\rho\_\{\\perp\}^\{D\}\), and panels \(d–f\) show curvature variability \(ρ⟂V\\rho\_\{\\perp\}^\{V\}\)\. The sign reversal is cross\-domain; the reasoning\-baseline separation is strongest on Codeforces and attenuated on SAT\.
## 5Results

Generation length qualitatively changes the interpretation of trajectory geometry\. In raw trajectories, harder problems appear less direct because they elicit longer generations, and longer sampled paths are mechanically less direct\. After residualizing trajectory statistics on length, this relationship reverses across Codeforces, MATH, and SAT: harder problems have more direct corrected trajectories\. The model\-class effect is more specific\. Corrected geometry separates reasoning models from matched baselines most clearly on Codeforces, weakly on MATH, and only modestly on SAT, where instruction\-tuned baselines also show positive corrected coupling\.

### 5\.1Length Correction Reveals a Cross\-Domain Sign Reversal

Figure[2](https://arxiv.org/html/2605.15454#S4.F2)shows the raw and length\-corrected directness\-difficulty correlations\. On Codeforces, all six reasoning models move from strongly negative raw coupling to positive correctedρ⟂D\\rho\_\{\\perp\}^\{D\}\(median−0\.73→\+0\.41\-0\.73\\rightarrow\+0\.41\), while matched baselines remain near zero after correction \(median−0\.06\-0\.06\)\. On MATH, the same pattern is weaker: corrected medians are\+0\.05\+0\.05for reasoning models and−0\.07\-0\.07for baselines\. On SAT, the reversal persists but is not reasoning\-specific, with positive corrected medians for both reasoning models and baselines \(\+0\.27\+0\.27and\+0\.23\+0\.23\)\. Per\-model 95% bootstrap CIs are reported in Appendix[C\.1](https://arxiv.org/html/2605.15454#A3.SS1)\. Thus, SAT extends the length\-correction result beyond code while bounding the reasoning\-specific interpretation\.

Codeforces provides the clearest controlled contrast\. R1\-Distill\-Qwen\-32B and QwQ\-32B share Qwen2\.5\-32B\-Instruct as their baseline but differ in reasoning\-training recipe; both reasoning models show positive corrected directness coupling, while the shared baseline remains near zero\. This within\-base comparison supports the code\-domain separation without relying only on family\-level differences\.

Curvature variability gives a complementary signal\. On Codeforces, reasoning models shift to strongly negative correctedρ⟂V\\rho\_\{\\perp\}^\{V\}\(median−0\.50\-0\.50\), while baselines remain near zero, consistent with less heterogeneous local bending on harder code problems after length correction\. MATH shows small effects for both groups, and SAT is intermediate \(reasoning median−0\.13\-0\.13, baseline median−0\.07\-0\.07\)\. Directness is the more interpretable statistic; curvature variability is the more stable robustness\-oriented complement\. Appendix[C\.2](https://arxiv.org/html/2605.15454#A3.SS2)shows that TwoNN and PCA90 are also length\-confounded, but their corrected patterns are weaker and less aligned with the reasoning\-baseline contrast\.

### 5\.2Geometry Gaps Are Not Mirrored by Linear Difficulty Probes

Figure[3](https://arxiv.org/html/2605.15454#S5.F3)compares corrected geometry gaps with linear difficulty decodability gaps\. In code,Δρ⟂D\\Delta\\rho\_\{\\perp\}^\{D\}is uniformly positive across matched pairs, whereasΔRprompt2\\Delta R^\{2\}\_\{\\mathrm\{prompt\}\}remains near zero and changes sign\. The same code pairs also have negativeΔRgen2\\Delta R^\{2\}\_\{\\mathrm\{gen\}\}, so the corrected geometric separation is not accompanied by stronger linear difficulty decoding during generation\. These results do not rule out nonlinear difficulty representations or differences in how difficulty information is used; they show only that the geometry gap is not a restatement of stronger linear decodability\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x3.png)Figure 3:Corrected geometry gaps are not mirrored by stronger linear difficulty decodability\. Each point or row is one matched reasoning\-baseline pair in one domain, for 18 pair\-domain records\. All quantities are reasoning minus baseline\. Panel \(a\) compares prompt\-stage linear decodability gapsΔRprompt2\\Delta R^\{2\}\_\{\\mathrm\{prompt\}\}with corrected geometric gapsΔρ⟂D\\Delta\\rho\_\{\\perp\}^\{D\}on a common signed axis\. Panel \(b\) plotsΔRprompt2\\Delta R^\{2\}\_\{\\mathrm\{prompt\}\}againstΔρ⟂D\\Delta\\rho\_\{\\perp\}^\{D\}; code pairs have large positive geometry gaps but near\-zero prompt\-probe gaps\. Panel \(c\) plotsΔRprompt2\\Delta R^\{2\}\_\{\\mathrm\{prompt\}\}against generation\-stage probe gapsΔRgen2\\Delta R^\{2\}\_\{\\mathrm\{gen\}\}from the peak layer×\\timesposition heatmap\. Linear probing does not show a corresponding reasoning\-model advantage, while the corrected geometry gap is largest on Codeforces and smaller on MATH and SAT\.Temporal\-prefix analyses further show that the code\-domain coupling is already visible by the first 10% of the generated solution segment and is then maintained\. In MATH, the signal is weaker and more heterogeneous, and when present it appears to build more gradually\. These analyses locate when the corrected geometry signal appears, but do not identify a causal mechanism\.

### 5\.3Robustness and Scope

The Codeforces reasoning\-baseline separation persists across several checks, with metric\-dependent caveats\. Theρ⟂D\\rho\_\{\\perp\}^\{D\}gap remains under the four length\-correction families in Appendix[C\.1](https://arxiv.org/html/2605.15454#A3.SS1), although its magnitude varies; the primarylog⁡N\\log Ncorrection agrees closely with theN−1/2N^\{\-1/2\}correction, while log\-log and binned corrections are less stable for directness\. Curvature variability is less intuitive but more consistent across correction choices, and therefore provides a useful complement\. Residual diagnostics show little remaining length dependence for directness, TwoNN, and PCA90 under the primary correction, but moderate residual length dependence for curvature variability\.

Additional checks support the main code\-domain sign pattern\. Boundary\-policy variants indicate that the result is not driven by answer\-boundary heuristics \(Appendix[D\.1](https://arxiv.org/html/2605.15454#A4.SS1)\); conditioning on correctness preserves the main reasoning\-baseline gap within both correct and incorrect subsets, although correctness composition still covaries with difficulty \(Appendix[D\.4](https://arxiv.org/html/2605.15454#A4.SS4)\); and layer analyses show that the code\-domain signal is present across the five sampled layers \(Appendix[D\.2](https://arxiv.org/html/2605.15454#A4.SS2)\)\. The remaining limitations are that the analyses identify a robust geometric association rather than a robust causal mechanism\. Per\-domain length\-correction values for MATH and SAT are reported in Appendix[C\.1](https://arxiv.org/html/2605.15454#A3.SS1)\.

## 6Observable Reasoning Behaviors Co\-vary with the Geometric Signal

The probe analyses show that the corrected geometry gap is not simply mirrored by stronger linear difficulty decodability\. We next ask whether the geometric signal has an observable counterpart in the generated reasoning traces\. We focus on Codeforces, where the reasoning\-baseline separation in corrected trajectory geometry is strongest\.

We annotate generated solution segments sentence by sentence using three independent LLM judges\. The annotation scheme covers strategy shifting, uncertainty monitoring, self\-correction, verification, problem restatement, and subgoal decomposition\. Sentence labels are aggregated into per\-problem behavior rates over the generated solution segment\. Judge identities, prompts, metadata visibility, and agreement statistics are reported in Appendix[F\.2](https://arxiv.org/html/2605.15454#A6.SS2); residualized indirect\-effect analyses are reported in Appendix[F\.3](https://arxiv.org/html/2605.15454#A6.SS3)\.

The strongest behavioral correlates are strategy shifting and uncertainty monitoring\. In all four R1\-distilled models on Codeforces, both behaviors have positive residualized indirect effects with bootstrap confidence intervals excluding zero\. QwQ\-32B shows the same direction more weakly\. Phi\-4\-Reasoning shows a different profile: verification is the strongest behavioral correlate, while uncertainty monitoring is not positive\. This heterogeneity suggests that the behavioral analysis does not identify a universal reasoning\-model mechanism; instead, it provides an observable correlate of the code\-domain geometric separation\.

These annotations connect the geometric signal to observable trace\-level behavior, but they do not provide a causal explanation\. Behaviors and geometry are measured from the same generated traces, and trace content can itself affect trajectory shape\. The behavioral results are descriptive evidence that corrected trajectory geometry co\-varies with recognizable reasoning dynamics\.

## 7Discussion

Generation length is a structural variable in generation\-time trajectory geometry\. Straightness\-style path statistics depend on path structure and length, and prior language\-model work has shown that trajectory geometry can be informative when the trajectory regime is well specified\(Benhamou,[2004](https://arxiv.org/html/2605.15454#bib.bib13); Hosseini and Fedorenko,[2023](https://arxiv.org/html/2605.15454#bib.bib12)\)\. In token\-time generation, response length varies with problem difficulty, correctness, and model class, so raw geometric statistics mix trajectory organization with path\-length mechanics\. Length correction therefore changes the object of analysis: it separates geometry associated with how generation unfolds from geometry induced by how long generation continues\.

This length\-aware view reveals difficulty\-dependent trajectory structure across the domains we study\. Corrected geometry retains systematic coupling with item difficulty after the dominant length component is removed, showing that harder problems are not characterized only by longer traces\. This is especially relevant for reasoning models, where test\-time compute, problem difficulty, response length, and correctness interact in nontrivial ways\(Snellet al\.,[2024](https://arxiv.org/html/2605.15454#bib.bib5); Chenet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib3); Wanget al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib4); Suet al\.,[2025](https://arxiv.org/html/2605.15454#bib.bib6)\)\. The corrected statistics are not a direct measure of reasoning quality; they are a controlled description of how hidden\-state trajectories vary with difficulty during generation\.

The reasoning\-specific pattern is strongest in competitive programming\. In that domain, matched reasoning models and instruction\-tuned baselines differ most clearly after length correction, suggesting that reasoning training changes how trajectories adapt as problems become harder\. A plausible explanation is that hard code problems more visibly elicit strategy selection, revision, and verification over extended traces\. Mathematics and Boolean satisfiability still show corrected difficulty–geometry coupling, though the separation between model classes is weaker\. This domain dependence is informative: corrected geometry captures both general difficulty\-conditioned generation and, in the code setting, a sharper reasoning\-training contrast\.

The probe and behavioral analyses refine this interpretation\. Prompt\-stage linear probes do not mirror the code\-domain geometry gap, separating difficulty decodability from difficulty\-conditioned trajectory dynamics\. This makes generation\-time geometry a complementary object of study: it asks not only whether difficulty information is present, but how the trajectory evolves while the model produces a solution\. Behavioral annotations provide an observable counterpart to the geometric signal\. Stronger corrected coupling co\-occurs with strategy shifting and uncertainty monitoring, linking the trajectory statistics to recognizable features of generated reasoning traces\. These analyses are descriptive rather than causal, since the annotations and geometry are derived from the same outputs\.

The broader implication is practical\. Generation\-time representation geometry should be analyzed conditionally on the sampling and segmentation regime\. Comparisons between easy and hard problems, correct and incorrect solutions, or reasoning and non\-reasoning models should report raw and length\-corrected statistics and check residual dependence on length\. Without these controls, apparent differences in trajectory organization can reflect path\-length mechanics rather than differences in how generation unfolds\.

## 8Conclusion

Generation\-time hidden\-state geometry cannot be interpreted independently of response length\. Across competitive programming, mathematics, and Boolean satisfiability, raw trajectory statistics conflate difficulty\-dependent structure with the mechanical effects of longer generations, while length\-corrected statistics reveal systematic coupling between item difficulty and trajectory geometry\. Within this corrected view, reasoning\-trained models show their clearest separation from matched instruction\-tuned baselines in competitive programming, where harder problems induce more direct trajectories and less heterogeneous local curvature; the weaker separation in the other domains shows that the effect is domain\-dependent rather than a universal signature of reasoning training\. These results make length correction a necessary step for studying representation geometry during generation and suggest that reasoning training can change how internal trajectories adapt to problem difficulty\. Establishing causal control over these trajectories remains an important direction for understanding how reasoning behavior is organized during generation\.

## Limitations

The main limitations concern segmentation, correction choice, and causal interpretation\. Reasoning models with explicit thinking delimiters provide cleaner solution segments than instruction\-tuned baselines, whose boundaries must be inferred from answer markers; baseline comparisons are therefore partly dependent on segmentation policy\. Directness is more sensitive to the length\-correction family than curvature variability, so it should be read together with robustness checks and complementary metrics\. Correctness still covaries with difficulty, even though correctness\-conditioned analyses preserve the main code\-domain pattern\. Behavioral annotations provide observable correlates of the geometric signal, but not a mechanism, since labels and geometry come from the same traces\. Finally, fixed\-stride sampling may miss finer temporal structure, leaving causal and higher\-resolution analyses for future work\.

## Acknowledgments

This work was supported by the Novo Nordisk Foundation grant NNF22OC0076907, “Cognitive spaces – Next generation explainability”, the Pioneer Centre for AI, DNRF grant number P1, and the Danish Data Science Academy, which is funded by the Novo Nordisk Foundation \(NNF21SA0069429\) and VILLUM FONDEN \(40516\)\. Anders Gjølbye conducted part of this work while visiting Stanford University\. Sanmi Koyejo acknowledges support by NSF 2046795 and 2205329, IES R305C240046, ARPA\-H, the MacArthur Foundation, Schmidt Sciences, HAI, OpenAI, Microsoft, and Google\.

## References

- Understanding intermediate layers using linear classifier probes\.InInternational Conference on Learning Representations \(ICLR\), Workshop Track,Cited by:[§E\.1](https://arxiv.org/html/2605.15454#A5.SS1.p3.5)\.
- S\. Benhamou \(2004\)How to reliably estimate the tortuosity of an animal’s path: straightness, sinuosity, or fractal dimension?\.Journal of Theoretical Biology229\(2\),pp\. 209–220\.Cited by:[§1](https://arxiv.org/html/2605.15454#S1.p3.1),[§7](https://arxiv.org/html/2605.15454#S7.p1.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2025\)Do NOT think that much for 2\+3=? on the overthinking of long reasoning models\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 9487–9499\.Cited by:[§1](https://arxiv.org/html/2605.15454#S1.p1.1),[§1](https://arxiv.org/html/2605.15454#S1.p2.1),[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2605.15454#S7.p2.1)\.
- E\. A\. Codling, M\. J\. Plank, and S\. Benhamou \(2008\)Random walk models in biology\.Journal of the Royal Society Interface5\(25\),pp\. 813–834\.Cited by:[§1](https://arxiv.org/html/2605.15454#S1.p3.1)\.
- H\. Damirchi, I\. Meza De la Jara, E\. Abbasnejad, A\. Shamsi, Z\. Zhang, and J\. Shi \(2026\)Truth as a trajectory: what internal representations reveal about large language model reasoning\.External Links:2603\.01326Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Ding, C\. Deng, J\. Choo, Z\. Wu, A\. Agrawal, A\. Schwarzschild, T\. Zhou, T\. Goldstein, J\. Langford, A\. Anandkumar, and F\. Huang \(2024\)Easy2Hard\-Bench: standardized difficulty labels for profiling LLM performance and generalization\.InAdvances in Neural Information Processing Systems 37,Note:Datasets and Benchmarks TrackCited by:[§3](https://arxiv.org/html/2605.15454#S3.p2.1)\.
- E\. Facco, M\. d’Errico, A\. Rodriguez, and A\. Laio \(2017\)Estimating the intrinsic dimension of datasets by a minimal neighborhood information\.Scientific Reports7,pp\. 12140\.Cited by:[§B\.4](https://arxiv.org/html/2605.15454#A2.SS4.SSS0.Px3.p1.3)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§3](https://arxiv.org/html/2605.15454#S3.p2.1)\.
- J\. Hewitt and P\. Liang \(2019\)Designing and interpreting probes with control tasks\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§E\.1](https://arxiv.org/html/2605.15454#A5.SS1.p3.5)\.
- E\. A\. Hosseini and E\. Fedorenko \(2023\)Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.15454#S7.p1.1)\.
- Y\. Huang, H\. Chen, S\. Ruan, Y\. Zhang, X\. Wei, and Y\. Dong \(2025\)Mitigating overthinking in large reasoning models via manifold steering\.arXiv preprint arXiv:2505\.22411\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px3.p1.1)\.
- I\. T\. Jolliffe and J\. Cadima \(2016\)Principal component analysis: a review and recent developments\.Philosophical Transactions of the Royal Society A374\(2065\),pp\. 20150202\.Cited by:[§B\.4](https://arxiv.org/html/2605.15454#A2.SS4.SSS0.Px4.p1.4)\.
- S\. Lee, Q\. Yin, C\. T\. Leong, J\. Zhang, Y\. Gong, S\. Ni, M\. Yang, and X\. Shen \(2025\)Probing the difficulty perception mechanism of large language models\.arXiv preprint arXiv:2510\.05969\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2024\)Inference\-time intervention: eliciting truthful answers from a language model\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§E\.3](https://arxiv.org/html/2605.15454#A5.SS3.SSS0.Px3.p1.3)\.
- W\. Lugoloobi and C\. Russell \(2025\)LLMs encode how difficult problems are\.arXiv preprint arXiv:2510\.18147\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px2.p1.1)\.
- F\. M\. Polo, L\. Choshen, W\. Sun,et al\.\(2024\)TinyBenchmarks: evaluating LLMs with fewer examples\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Rasch \(1960\)Probabilistic models for some intelligence and attainment tests\.Danish Institute for Educational Research\.Cited by:[§3](https://arxiv.org/html/2605.15454#S3.p3.10)\.
- S\. Ravfogel, Y\. Elazar, H\. Gonen, M\. Twiton, and Y\. Goldberg \(2020\)Null it out: guarding protected attributes by iterative nullspace projection\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 7237–7256\.Cited by:[§E\.3](https://arxiv.org/html/2605.15454#A5.SS3.SSS0.Px5.p1.5)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§1](https://arxiv.org/html/2605.15454#S1.p1.1),[§1](https://arxiv.org/html/2605.15454#S1.p2.1),[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2605.15454#S7.p2.1)\.
- J\. Su, J\. Healey, P\. Nakov, and C\. Cardie \(2025\)Between underthinking and overthinking: an empirical study of reasoning length and correctness in LLMs\.arXiv preprint arXiv:2505\.00127\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2605.15454#S7.p2.1)\.
- L\. Sun, H\. Dong, B\. Qiao, Q\. Lin, D\. Zhang, and S\. Rajmohan \(2026\)LLM reasoning as trajectories: step\-specific representation geometry and correctness signals\.InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px1.p1.1)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2023\)Steering language models with activation engineering\.arXiv preprint arXiv:2308\.10248\.Cited by:[§E\.3](https://arxiv.org/html/2605.15454#A5.SS3.SSS0.Px3.p1.3)\.
- Y\. Wang, Q\. Liu, J\. Xu, T\. Liang, X\. Chen, Z\. He, L\. Song, D\. Yu, J\. Li, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2025\)Thoughts are all over the place: on the underthinking of long reasoning models\.InAdvances in Neural Information Processing Systems,Note:Datasets and Benchmarks Track, SpotlightCited by:[§1](https://arxiv.org/html/2605.15454#S1.p1.1),[§1](https://arxiv.org/html/2605.15454#S1.p2.1),[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2605.15454#S7.p2.1)\.
- A\. Wei, Y\. Wu, Y\. Wan, T\. Suresh, H\. Tan, Z\. Zhou, S\. Koyejo, K\. Wang, and A\. Aiken \(2025\)SATBench: benchmarking LLMs’ logical reasoning via automated puzzle generation from SAT formulas\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 33832–33849\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1716)Cited by:[§3](https://arxiv.org/html/2605.15454#S3.p2.1)\.
- Z\. Xu, J\. Liu, Y\. Wang, and Y\. Gu \(2025\)Latency\-response theory model: evaluating large language models via response accuracy and chain\-of\-thought length\.arXiv preprint arXiv:2512\.07019\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Zhou, H\. Huang, Z\. Zhao,et al\.\(2025\)Lost in benchmarks? Rethinking large language model benchmarking with item response theory\.arXiv preprint arXiv:2505\.15055\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhou, Y\. Wang, X\. Yin, S\. Zhou, and A\. R\. Zhang \(2026\)The geometry of reasoning: flowing logics in representation space\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhu, D\. Liu, Z\. Lin, W\. Tong, S\. Zhong, and J\. Shao \(2025\)The LLM already knows: estimating LLM\-perceived question difficulty via hidden representations\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1160–1176\.Cited by:[§2](https://arxiv.org/html/2605.15454#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AData and Difficulty Calibration

### A\.1Datasets

We evaluate on 500 Easy2Hard\-Bench competitive\-programming problems, 500 MATH problems, and 500 SATBench problems\. SATBench items are stratified into five clause\-count bins spanning 4–45 clauses and are approximately balanced between satisfiable and unsatisfiable instances within each bin\. No item is shared across domains\. Native difficulty labels are used for external validation of the latent difficulty scale, not as the primary independent variable in trajectory analyses\.

### A\.2Model inventory

Table[2](https://arxiv.org/html/2605.15454#A1.T2)lists all 32 models\. The 11 unique models forming the six matched pairs \(Qwen2\.5\-32B\-Instruct serves as the shared baseline of two pairs\) have hidden\-state access; activations are extracted at five evenly spaced layers with stride 10\. The remaining 21 models contribute correctness data only and stabilize the IRT difficulty scale\. Per\-domain calibration pools contain 32 models on each domain\.

Table 2:All models used in this study\. Hidden dimensionddand layer countLLare shown for the matched\-pair models\.θcode\\theta\_\{\\text\{code\}\},θmath\\theta\_\{\\text\{math\}\},θsat\\theta\_\{\\text\{sat\}\}are pooled binomial Rasch ability estimates per domain\.ModelSourceddLLRoleθcode\\theta\_\{\\text\{code\}\}θmath\\theta\_\{\\text\{math\}\}θsat\\theta\_\{\\text\{sat\}\}*Matched pairs \(11 models\)*R1\-Distill\-Qwen\-7BDeepSeek3,58428Reasoning−0\.99\-0\.99\+1\.57\+1\.57\+0\.24\+0\.24R1\-Distill\-Qwen\-14BDeepSeek5,12048Reasoning\+0\.34\+0\.34\+1\.75\+1\.75\+0\.74\+0\.74R1\-Distill\-Qwen\-32BDeepSeek5,12064Reasoning\+0\.64\+0\.64\+1\.66\+1\.66\+1\.16\+1\.16R1\-Distill\-Llama\-8BDeepSeek4,09632Reasoning−0\.71\-0\.71\+1\.22\+1\.22\+0\.53\+0\.53QwQ\-32BQwen5,12064Reasoning\+0\.82\+0\.82\+1\.87\+1\.87\+1\.21\+1\.21Phi\-4\-ReasoningMicrosoft5,12040Reasoning\+0\.34\+0\.34\+2\.99\+2\.99\+1\.38\+1\.38Qwen2\.5\-7B\-InstructQwen3,58428Baseline−3\.28\-3\.28\+1\.59\+1\.59\+0\.03\+0\.03Qwen2\.5\-14B\-InstructQwen5,12048Baseline−2\.84\-2\.84\+1\.88\+1\.88\+0\.25\+0\.25Qwen2\.5\-32B\-InstructQwen5,12064Baseline−1\.89\-1\.89\+2\.04\+2\.04\+0\.31\+0\.31Llama\-3\.1\-8B\-InstructMeta4,09632Baseline−4\.61\-4\.61\+0\.27\+0\.27−0\.19\-0\.19Phi\-4Microsoft5,12040Baseline−2\.06\-2\.06\+2\.05\+2\.05\+0\.33\+0\.33*IRT calibration models \(21\)*Phi\-3\.5\-Mini\-InstructMicrosoft––Calibration−4\.59\-4\.59−0\.43\-0\.43\+0\.02\+0\.02Gemma\-2\-9B\-ITGoogle––Calibration−4\.16\-4\.16−0\.19\-0\.19−0\.41\-0\.41Mistral\-7B\-InstructMistral––Calibration−5\.85\-5\.85−3\.17\-3\.17−0\.13\-0\.13Qwen2\.5\-Math\-7B\-InstructQwen––Calibration−5\.66\-5\.66\+2\.09\+2\.09\+0\.04\+0\.04DeepSeek\-7B\-ChatDeepSeek––Calibration−7\.82\-7\.82−1\.94\-1\.94−0\.98\-0\.98OLMo\-7B\-InstructAI2––Calibration−7\.64\-7\.64−0\.92\-0\.92−0\.04\-0\.04Qwen2\-7B\-InstructQwen––Calibration−5\.09\-5\.09\+0\.68\+0\.68\+0\.08\+0\.08Zephyr\-7B\-BetaHuggingFace––Calibration−6\.86\-6\.86−4\.52\-4\.52−0\.14\-0\.14Mistral\-Small\-24BMistral––Calibration−2\.86\-2\.86\+1\.84\+1\.84\+0\.24\+0\.24Claude Haiku 4\.5Anthropic––Calibration\+0\.70\+0\.70\+2\.81\+2\.81\+0\.82\+0\.82Claude Sonnet 4Anthropic––Calibration\+0\.65\+0\.65\+3\.12\+3\.12\+0\.94\+0\.94DeepSeek\-V3DeepSeek––Calibration\+1\.53\+1\.53\+2\.87\+2\.87\+2\.67\+2\.67Gemini 2\.5 Flash LiteGoogle––Calibration\+0\.65\+0\.65\+3\.13\+3\.13\+1\.20\+1\.20Gemini 2\.5 FlashGoogle––Calibration\+1\.52\+1\.52\+3\.49\+3\.49\+2\.07\+2\.07Gemini 2\.5 ProGoogle––Calibration\+2\.99\+2\.99\+1\.02\+1\.02\+3\.64\+3\.64Gemma\-3\-27BGoogle––Calibration−1\.20\-1\.20\+2\.78\+2\.78\+0\.18\+0\.18GPT\-4o\-MiniOpenAI––Calibration−1\.59\-1\.59\+1\.69\+1\.69\+0\.09\+0\.09GPT\-4oOpenAI––Calibration−2\.59\-2\.59\+1\.75\+1\.75\+0\.27\+0\.27Llama\-3\.3\-70B\-InstructMeta––Calibration−1\.42\-1\.42\+2\.28\+2\.28\+0\.23\+0\.23o4\-miniOpenAI––Calibration\+2\.60\+2\.60\+2\.38\+2\.38\+2\.48\+2\.48Qwen2\.5\-72B\-InstructQwen––Calibration−2\.09\-2\.09\+2\.03\+2\.03\+0\.36\+0\.36
### A\.3Matched\-pair convention

The main analysis uses six matched reasoning\-baseline comparisons\. Because Qwen2\.5\-32B\-Instruct is the shared baseline of both R1\-Distill\-Qwen\-32B and QwQ\-32B, the six comparisons contain five unique baseline models\. Pair\-level summaries count six baseline appearances; unique\-model summaries count five\. We state which convention is used whenever reporting counts\.

### A\.4Correctness evaluation

#### Competitive programming\.

The last Python code block is extracted from each trace and executed against official test cases in sandboxed subprocesses \(5 s timeout\)\. Output comparison uses whitespace normalization, float tolerance \(10−610^\{\-6\}\), and case\-insensitive boolean matching\.

#### Mathematics\.

The last\\boxed\{\}expression is extracted \(handling nested braces\)\. Comparison uses exact string matching, SymPy symbolic equivalence, and numerical tolerance\.

#### Boolean satisfiability\.

The firstSATISFIABLEorUNSATISFIABLEtoken in the trace is extracted \(case\-insensitive, with optional surrounding markup\) and compared against the ground\-truth label\. Traces with no detected marker are treated as incorrect\.

### A\.5Prompt templates and decoding

All models are sampled with temperature0\.60\.6, nucleusp=0\.95p=0\.95, a maximum of 32,768 tokens, and a fixed seed per run\. R1\-distilled models receive a<think\>prefix to trigger extended reasoning; Phi\-4\-Reasoning uses its native reasoning prompt format; instruction\-tuned baselines receive identical problem statements without thinking delimiters\. The main analysis uses five runs per problem per model, with 30 runs for R1\-Distill\-Qwen\-7B in the run\-count stability analysis \(Appendix[D\.5](https://arxiv.org/html/2605.15454#A4.SS5)\)\.

### A\.6Rasch calibration

For itemiiand modeljj, observed successes are modeled as

kij∼Binomial\(nij,σ\(θj−bi\)\),k\_\{ij\}\\sim\\mathrm\{Binomial\}\\\!\\left\(n\_\{ij\},\\sigma\(\\theta\_\{j\}\-b\_\{i\}\)\\right\),whereθj\\theta\_\{j\}is model ability andbib\_\{i\}is item difficulty\. We use the fittedbib\_\{i\}as the continuous difficulty variable in all downstream analyses\.

#### Overview\.

We validate pooled binomial Rasch calibration along four axes: boundary\-item structure and item\-pool coverage, agreement with withheld native difficulty labels, parsimony relative to 2PL, and leave\-one\-out \(LOO\) stability under removal of any single calibration model\.

Table 3:Calibration pool and optimization summary\.
#### Optimization behavior\.

MAP estimation uses Adam with PyTorch defaults \(β1=0\.9,β2=0\.999\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.999\); full settings are in Table[3](https://arxiv.org/html/2605.15454#A1.T3)\. Across the three domains, optimization runs for 1,574–3,104 epochs before the loss plateaus\. This slow\-tail convergence is expected for Rasch MAP fitting and is benign for downstream inference because subsequent analyses use rank\-level properties ofbib\_\{i\}, which are highly stable under recalibration\.

#### Boundary structure and coverage\.

Boundary items are solved on every run by every model \(ceiling\) or failed on every run by every model \(floor\)\. In code, 457 items are informative and 43 are floor\-boundary; in math, 470 are informative and 30 are floor\-boundary; in SAT, all 500 items are informative with neither floor nor ceiling boundary mass, reflecting that no SATBench instance is solved by every model or failed by every model in the 32\-model calibration pool\. All three domains have zero ceiling\-boundary items\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x4.png)Figure 4:Boundary structure of pooled 1PL item difficulties\. Left: code domain\. Center: math domain\. Right: SAT domain\. Code and math show floor\-only boundary mass \(no ceiling items\), with boundary difficulties fixed by the prior\-offset convention described in Section[3](https://arxiv.org/html/2605.15454#S3)\. SAT has no boundary items in the 32\-model calibration pool: all 500 SATBench instances are informative\.
#### External validity\.

Against withheld native labels, code\-domain difficulty aligns with Codeforces Glicko\-2 ratings \(Pearsonr=0\.520r=0\.520, Spearmanρ=0\.552\\rho=0\.552,n=500n=500\)\. Math\-domain difficulty aligns with MATH levels \(Spearmanρ=0\.435\\rho=0\.435\); cross\-level differences are strong \(Kruskal–WallisH=100\.27H=100\.27, 4 d\.f\.,p<10−20p<10^\{\-20\}\)\. SAT\-domain difficulty aligns with SAT clause counts \(Pearsonr=0\.583r=0\.583, Spearmanρ=0\.560\\rho=0\.560,n=500n=500,p<10−42p<10^\{\-42\}\), the strongest external alignment of the three domains\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x5.png)Figure 5:External validation using labels not used during IRT fitting\. \(a\) Code: IRT difficulty versus Codeforces Glicko\-2 rating\. \(b\) Math: IRT difficulty stratified by MATH levels L1–L5\. \(c\) SAT: IRT difficulty versus SAT clause count \(Pearsonr=0\.583r=0\.583, Spearmanρ=0\.560\\rho=0\.560,n=500n=500\)\.Table 4:Compact validation summary for pooled IRT calibration\.
#### 1PL parsimony and LOO stability\.

Allowing free item discriminations in 2PL leaves the difficulty ordering nearly unchanged \(ρ\(1PL,2PL\)=0\.9893\\rho\(1\\mathrm\{PL\},2\\mathrm\{PL\}\)=0\.9893code,0\.97160\.9716math,0\.93710\.9371SAT\), supporting 1PL as the parsimonious choice\. LOO recalibration over models yields near\-identical rankings relative to the full\-pool fit \(median Spearmanρ\\rhoof0\.99970\.9997on code and math and0\.99150\.9915on SAT; minimumρ\\rhoof0\.99060\.9906code,0\.99830\.9983math,0\.98580\.9858SAT\), indicating that no single model drives the scale\.

External agreement is moderate rather than near\-perfect \(ρ≈0\.43\\rho\\approx 0\.43to0\.560\.56\), as expected because Codeforces ratings reflect contest\-population dynamics, MATH levels are coarse ordinals, and SAT clause counts measure structural rather than algorithmic difficulty\. The resulting latent variable is therefore interpreted as model\-pooled difficulty, with strong internal stability and meaningful but not identity\-level alignment to native labels\.

### A\.7Compute and storage\.

Local runs for the larger open\-weight models used three NVIDIA H100 GPUs with 80 GB of VRAM each; smaller open\-weight models were run on NVIDIA L40S GPUs\. The main cost was not ordinary decoding, but decoding while saving hidden states, which slowed generation substantially\. Across the full set of model families, domains, and robustness checks, local generation and activation extraction took several weeks to months of wall\-clock time\. The stored artifacts occupy approximately 3 TB \(see §[3](https://arxiv.org/html/2605.15454#S3)\); the bulk is extracted hidden states and intermediate trajectory representations\. API calls were used for non\-local calibration models only; hidden\-state trajectories were extracted exclusively from locally run open\-weight models\.

## Appendix BTrajectories and Metrics

### B\.1Hidden\-state tensor

For each selected decoder layerℓ\\ell, we extract the post\-block residual\-stream output at generation steptt, evaluated at the final generated\-token position\. During generation, this yields a trajectory

𝐡imr,0\(ℓ\),𝐡imr,1\(ℓ\),…,𝐡imr,Timr\(ℓ\)∈ℝd\.\\mathbf\{h\}\_\{imr,0\}^\{\(\\ell\)\},\\mathbf\{h\}\_\{imr,1\}^\{\(\\ell\)\},\\dots,\\mathbf\{h\}\_\{imr,T\_\{imr\}\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}\.This corresponds to the direct output of the decoder layer module after the full attention and feed\-forward block with residual connections, rather than pre\-layernorm, attention\-only, or FFN\-only activations\.

### B\.2Sampling, layers, and stride

Hidden states are extracted at five evenly spaced layers, with indices\{⌊i⋅\(L−1\)/4⌋:i=0,…,4\}\\\{\\lfloor i\\cdot\(L\{\-\}1\)/4\\rfloor:i=0,\\ldots,4\\\}, whereLLis the total number of layers\. States are captured every 10 generated tokens\. Trajectory\-geometry statistics report the median across the five sampled layers, while probe scores \(Appendix[E](https://arxiv.org/html/2605.15454#A5)\) use the peak \(layer×\\timesposition\) cell selected independently of the trajectory metric\. Layer\-specific values forρ⟂D\\rho\_\{\\perp\}^\{D\}are reported in Appendix[D\.2](https://arxiv.org/html/2605.15454#A4.SS2)\.

### B\.3Generated solution segments

For models with explicit reasoning delimiters, the closing delimiter defines the boundary between reasoning and answer phases\. For non\-R1 models, the boundary is detected heuristically: the first code fence for competitive programming,\\boxed\{\}for mathematics, the firstSATISFIABLE/UNSATISFIABLEmarker for Boolean satisfiability, and XML answer tags as a fallback\. If no boundary pattern is detected, the full output is treated as reasoning for baseline models and as answer\-only for tagged reasoning models\. All trajectory metrics are computed on the generated solution segment\.

Segmentation is exact for tagged reasoning models and heuristic for instruction\-tuned baselines\. This asymmetry is a genuine limitation of current open\-model formats\. Boundary\-detection rates, fallback rates, and boundary\-policy sensitivity checks are reported in Appendix[D\.1](https://arxiv.org/html/2605.15454#A4.SS1)\.

### B\.4Trajectory metrics

#### Directness\.

For trajectory\(𝐡imr,0\(ℓ\),…,𝐡imr,Timr\(ℓ\)\)\(\\mathbf\{h\}\_\{imr,0\}^\{\(\\ell\)\},\\dots,\\mathbf\{h\}\_\{imr,T\_\{imr\}\}^\{\(\\ell\)\}\), define path length and net displacement

Limr\(ℓ\):=∑t=1Timr‖𝐡imr,t\(ℓ\)−𝐡imr,t−1\(ℓ\)‖2,Δimr\(ℓ\):=‖𝐡imr,Timr\(ℓ\)−𝐡imr,0\(ℓ\)‖2\.L\_\{imr\}^\{\(\\ell\)\}:=\\sum\_\{t=1\}^\{T\_\{imr\}\}\\left\\\|\\mathbf\{h\}\_\{imr,t\}^\{\(\\ell\)\}\-\\mathbf\{h\}\_\{imr,t\-1\}^\{\(\\ell\)\}\\right\\\|\_\{2\},\\quad\\Delta\_\{imr\}^\{\(\\ell\)\}:=\\left\\\|\\mathbf\{h\}\_\{imr,T\_\{imr\}\}^\{\(\\ell\)\}\-\\mathbf\{h\}\_\{imr,0\}^\{\(\\ell\)\}\\right\\\|\_\{2\}\.Directness isDimr\(ℓ\)=Δimr\(ℓ\)/Limr\(ℓ\)∈\[0,1\]D\_\{imr\}^\{\(\\ell\)\}=\\Delta\_\{imr\}^\{\(\\ell\)\}/L\_\{imr\}^\{\(\\ell\)\}\\in\[0,1\]\. Runs with fewer than two sampled states cannot define directness and are excluded from directness analyses\.

#### Curvature variability\.

For three consecutive pointsA,B,C∈ℝdA,B,C\\in\\mathbb\{R\}^\{d\}along a trajectory, the Menger curvature is

κ\(A,B,C\)=4⋅Area\(△ABC\)\|AB\|⋅\|BC\|⋅\|AC\|,\\kappa\(A,B,C\)=\\frac\{4\\cdot\\mathrm\{Area\}\(\\triangle ABC\)\}\{\|AB\|\\cdot\|BC\|\\cdot\|AC\|\},\(4\)where the triangle area inℝd\\mathbb\{R\}^\{d\}is

Area\(△ABC\)=12‖𝐮‖2‖𝐯‖2−\(𝐮⊤𝐯\)2,𝐮=B−A,𝐯=C−A\.\\mathrm\{Area\}\(\\triangle ABC\)=\\tfrac\{1\}\{2\}\\sqrt\{\\\|\\mathbf\{u\}\\\|^\{2\}\\\|\\mathbf\{v\}\\\|^\{2\}\-\(\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\)^\{2\}\},\\quad\\mathbf\{u\}=B\-A,\\;\\mathbf\{v\}=C\-A\.For trajectory\(𝐡imr,t\(ℓ\)\)t=0Timr\\bigl\(\\mathbf\{h\}\_\{imr,t\}^\{\(\\ell\)\}\\bigr\)\_\{t=0\}^\{T\_\{imr\}\}, curvature variability is

Vimr\(ℓ\):=sd⁡\(κimr,1\(ℓ\),…,κimr,Timr−1\(ℓ\)\)\.V\_\{imr\}^\{\(\\ell\)\}:=\\operatorname\{sd\}\\\!\\left\(\\kappa\_\{imr,1\}^\{\(\\ell\)\},\\dots,\\kappa\_\{imr,T\_\{imr\}\-1\}^\{\(\\ell\)\}\\right\)\.Curvature variability is defined only for trajectories with at least three sampled states\. It measures heterogeneity in local bending rather than total turning\. Negativeρ⟂V\\rho\_\{\\perp\}^\{V\}indicates less heterogeneous local bending after length adjustment, not necessarily less total turning\.

#### TwoNN intrinsic dimension\.

We estimate intrinsic dimension from the ratio of the two nearest\-neighbor distances of each sampled state\[Faccoet al\.,[2017](https://arxiv.org/html/2605.15454#bib.bib33)\]\. Letri,1r\_\{i,1\}andri,2r\_\{i,2\}be the distances from stateiito its first and second nearest neighbors among the trajectory’s sampled states\. Then

μi=ri,2ri,1,d^TwoNN=\(1n∑i=1nlog⁡μi\)−1,\\mu\_\{i\}=\\frac\{r\_\{i,2\}\}\{r\_\{i,1\}\},\\qquad\\widehat\{d\}\_\{\\mathrm\{TwoNN\}\}=\\left\(\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\mu\_\{i\}\\right\)^\{\-1\},wherennis the number of sampled states with a nonzero first\-nearest\-neighbor distance\. Sampled states with duplicate nearest neighbors \(ri,1=0r\_\{i,1\}=0\) are excluded, and nearest neighbors are computed within the sampled states of the same trajectory\.

#### PCA90 dimensionality\.

PCA90 reports the smallest number of principal components capturing 90% of the variance of the trajectory’s sampled states\[Jolliffe and Cadima,[2016](https://arxiv.org/html/2605.15454#bib.bib34)\]\. LetZ∈ℝT×dZ\\in\\mathbb\{R\}^\{T\\times d\}be the matrix of sampled trajectory states after centering across theTTsampled time points, and letλ1≥⋯≥λr\\lambda\_\{1\}\\geq\\cdots\\geq\\lambda\_\{r\}be the nonzero eigenvalues of the sample covariance, withr≤min⁡\(T−1,d\)r\\leq\\min\(T\-1,d\)\. Then

PCA90\(Z\)=min⁡\{k:∑j=1kλj∑j=1rλj≥0\.90\}\.\\mathrm\{PCA90\}\(Z\)=\\min\\\!\\left\\\{k:\\frac\{\\sum\_\{j=1\}^\{k\}\\lambda\_\{j\}\}\{\\sum\_\{j=1\}^\{r\}\\lambda\_\{j\}\}\\geq 0\.90\\right\\\}\.Length\-residualized difficulty couplings for TwoNN and PCA90 are reported in Appendix[C\.2](https://arxiv.org/html/2605.15454#A3.SS2)\.

## Appendix CLength Dependence and Auxiliary Geometry

### C\.1Alternative length corrections

The primary correction residualizes each trajectory statistic onlog⁡N\\log Nseparately within domain, model, and sampled layer; the residuals are then correlated with IRT difficultybib\_\{i\}at the item level\. We compare this estimator againstN−1/2N^\{\-1/2\}residualization, log\-log residualization, and length\-binned matching to assess how strongly the reasoning\-baseline contrast depends on the functional form of the length correction\.

Under the primarylog⁡N\\log Ncorrection, Codeforcesρ⟂D\\rho\_\{\\perp\}^\{D\}for reasoning models is positive for all six matched pairs \(median\+0\.41\+0\.41\), while matched baselines center near zero or negative \(median−0\.06\-0\.06\)\. The complementaryρ⟂V\\rho\_\{\\perp\}^\{V\}separation is similarly clean and opposite in sign \(reasoning median−0\.50\-0\.50, baseline\+0\.05\+0\.05\)\. On MATH,ρ⟂D\\rho\_\{\\perp\}^\{D\}is mostly near zero after residualization, whileρ⟂V\\rho\_\{\\perp\}^\{V\}shows small negative shifts for most reasoning models\. The code\-domain reasoning signal is therefore not a raw\-length artifact, although its estimated strength depends on correction family forρ⟂D\\rho\_\{\\perp\}^\{D\}\.

Table 5:Cross\-method consistency: Spearman agreement with the primarylog⁡N\\log Ncorrection\.![Refer to caption](https://arxiv.org/html/2605.15454v1/x6.png)Figure 6:Length\-correction robustness across four correction families forρ⟂D\\rho\_\{\\perp\}^\{D\}andρ⟂V\\rho\_\{\\perp\}^\{V\}\. Each point shows one model under one correction family\. The primarylog⁡N\\log NandN−1/2N^\{\-1/2\}corrections agree closely, while directness is less stable under log\-log and binned corrections\. Curvature variability is more stable across correction families\.Thelog⁡N\\log NandN−1/2N^\{\-1/2\}corrections agree closely for bothρ⟂D\\rho\_\{\\perp\}^\{D\}andρ⟂V\\rho\_\{\\perp\}^\{V\}\(Spearman\+0\.96\+0\.96each\); agreement is much weaker for directness under log\-log and binned corrections \(−0\.003\-0\.003and−0\.62\-0\.62\), while curvature variability remains more stable across correction families \(Codeforces reasoning medians−0\.50,−0\.46,−0\.25,−0\.02\-0\.50,\-0\.46,\-0\.25,\-0\.02across the four methods\), although Codeforces residual diagnostics under the primarylog⁡N\\log Ncorrection \(Figure[7](https://arxiv.org/html/2605.15454#A3.F7)\) show that it retains moderate residual length association\. We use thelog⁡N\\log Ncorrection as the primary estimator because it is a simple monotone adjustment for generation length, applies uniformly across all trajectory statistics, and agrees closely with theN−1/2N^\{\-1/2\}correction\. Under this primary correction, Codeforces gives the clearest reasoning\-baseline separation\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x7.png)Figure 7:Codeforces residual length diagnostics under the primarylog⁡N\\log Ncorrection\. Each panel plots item\-level residuals from the primary length\-only regression againstlog⁡N\\log Nfor one trajectory statistic\. The reported Spearman correlations measure remaining monotonic association with generation length after residualization\. Directness, TwoNN, and PCA90 show little residual length association, while curvature variability retains a moderate negative association, indicating that curvature should be interpreted as correction\-family\-stable rather than fully length\-independent\.
### C\.2Auxiliary dimensionality descriptors

As auxiliary descriptors, we apply the same raw\-versus\-corrected correlation analysis from Figure[2](https://arxiv.org/html/2605.15454#S4.F2)to TwoNN intrinsic dimension and PCA90 dimensionality\. For each model and domain, we compute Spearman correlations with pooled\-IRT difficulty before and after residualizing each metric onlog⁡N\\log N, with 95% percentile bootstrap confidence intervals \(1000 resamples\)\.

The resulting pattern is asymmetric across metrics\. TwoNN on Codeforces partially reproduces the directness sign\-reversal structure: reasoning rows move from negative raw correlations to near\-zero or weakly positive corrected values \(median−0\.41→\+0\.03\-0\.41\\rightarrow\+0\.03\), while baselines change little \(median−0\.01→\+0\.07\-0\.01\\rightarrow\+0\.07\)\. PCA90 shows a strong positive raw association with difficulty, consistent with a positive length confound; after correction, code\-domain reasoning rows remain only modestly positive \(median\+0\.65→\+0\.18\+0\.65\\rightarrow\+0\.18\), while baselines center near zero or slightly negative \(\+0\.24→−0\.03\+0\.24\\rightarrow\-0\.03\)\. Math\-domain values are mixed across both groups\. On SAT, TwoNN raw correlations are near zero for both groups \(reasoning median\+0\.05\+0\.05, baseline\+0\.09\+0\.09\) and corrected values rise modestly \(\+0\.11\+0\.11R,\+0\.17\+0\.17B\); PCA90 reverses sign relative to code \(raw R\+0\.58\+0\.58, B\+0\.07\+0\.07; corrected0\.000\.00R,−0\.26\-0\.26B\)\.

![Refer to caption](https://arxiv.org/html/2605.15454v1/x8.png)Figure 8:Length correction applied to two intrinsic\-dimensionality metrics of hidden\-state trajectories\. Same dumbbell idiom as Figure[2](https://arxiv.org/html/2605.15454#S4.F2): each row is one model, hollow circles show raw Spearmanρ\\rhowith pooled\-IRT difficulty, and filled squares show length\-correctedρ⟂\\rho\_\{\\perp\}after residualization onlog⁡N\\log N\. Whiskers are 95% bootstrap confidence intervals \(1000 resamples\)\. Panels \(a–c\) plot TwoNN dimensionality on Codeforces, MATH, and SAT; panels \(d–f\) plot PCA90 dimensionality on the same three domains\. On code, TwoNN shows an attenuated reasoning\-model sign\-reversal pattern, while PCA90 remains positive but strongly shrunk after correction \(reasoning median\+0\.18\+0\.18vs baseline−0\.03\-0\.03\)\. Math panels are weaker and mixed\. On SAT, TwoNN does not reproduce the directness reversal and PCA90 reverses sign relative to code\.These dimensionality descriptors show that length affects more than the directness ratio, but their corrected patterns are weaker and less aligned with the reasoning\-baseline contrast than directness and curvature variability\.

## Appendix DRobustness Checks

### D\.1Boundary policy and segmentation

Because reasoning models often expose explicit delimiters while baselines do not, we evaluate whether boundary policy could mechanically induce the observed coupling pattern\.

Table 6:Boundary\-detection rates by model tier and domain\. Values are fractions of traces using each boundary source\.Table 7:Sensitivity ofρ\\rhoto boundary policy\. Representative boundary\-cut values at medianτ\\tau\.
![Refer to caption](https://arxiv.org/html/2605.15454v1/x9.png)Figure 9:Prefix and boundary sensitivity for directness\-difficulty coupling\. Prefix curves recompute corrected coupling using the first fraction of each generated solution segment; boundary variants compare full\-output trajectories with answer\-boundary cuts\. Codeforces reasoning models show positive corrected coupling from the earliest measured prefixes, while MATH shows weaker and more gradual emergence\. Boundary variants preserve the main Codeforces reasoning\-baseline separation\.Across tiers, API models do not emit<think\>tags; API\-code boundaries therefore rely on code fences, XML, or answer\-style XML tags \(Claude Haiku 4\.5 XML\-tag rate on Codeforces:0\.4830\.483\)\. Aggregate fallback rates are0\.0350\.035\(API\),0\.1010\.101\(local\), and0\.0260\.026\(pipeline\)\. Codeforces reasoning models consistently show raw\-to\-residual sign flips \(DeepSeek\-R1\-32B:−0\.755→\+0\.404\-0\.755\\to\+0\.404; QwQ\-32B:−0\.704→\+0\.411\-0\.704\\to\+0\.411\); on math, residualized values are mostly small\. Tagged\-empty incidence is exactly zero across all 12 reasoning\-model×\\timesdomain cases\. Boundary\-policy effects are also small for reasoning models in absolute terms \(Δfull=\|ρboundary−ρτ=1\|\\Delta\_\{\\mathrm\{full\}\}=\|\\rho\_\{\\text\{boundary\}\}\-\\rho\_\{\\tau=1\}\|: mean0\.0090\.009, max0\.0280\.028\) and larger for baselines \(mean0\.0900\.090, max0\.2210\.221\)\.

Boundary\-policy choice is therefore not the primary driver of the core code\-domain result\. Residualized coupling is substantially more stable than raw correlations across boundary variants\. The only outlier is Phi\-4\-Reasoning on math under fixed\-prefix variants \(ρ⟂,log⁡T≈−0\.59\\rho\_\{\\perp,\\log T\}\\approx\-0\.59to−0\.60\-0\.60\), which sits alongside the broader cross\-model pattern rather than disturbing it\.

### D\.2Layer sensitivity

Table 8:Layer\-stratifiedρ⟂D\\rho\_\{\\perp\}^\{D\}across the five sampled layers\. Median, minimum, and maximum are taken over each model’s five layer\-specific values, then aggregated as a median over models within each domain×\\timesgroup cell\. Counts use the pair\-level convention\.The qualitative Codeforces reasoning\-baseline separation is present at every sampled depth: reasoning models show positiveρ⟂D\\rho\_\{\\perp\}^\{D\}across all five sampled layers \(median\+0\.41\+0\.41\), while baselines remain near zero \(median−0\.05\-0\.05\)\. On MATH, the effect is weak across layers in both groups\. On SAT, the layer\-stratified medians are similar between groups \(\+0\.27\+0\.27vs\+0\.23\+0\.23\), reproducing the attenuated reasoning\-specificity reported in the main text\. Main figures report the median across layers; the qualitative conclusions do not depend on a single sampled layer\.

### D\.3Null\-label checks

As a null\-label check, item difficulties are shuffled within each domain and model before recomputing corrected difficulty\-geometry coupling\. The Codeforces reasoning\-model couplings lie in the tails of their null distributions, while the SAT pattern appears in both reasoning and baseline groups, matching the main analysis\. This check shows that the observed rank associations are not typical of arbitrary difficulty assignments\.

### D\.4Conditioning on correctness

Table 9:Length\-corrected directness\-difficulty coupling within correctness strata\. Group medians ofρ⟂D\\rho\_\{\\perp\}^\{D\}over pair\-level rows\. Strata\-size ranges are item counts retained per model after correctness filtering\.Conditioning on correctness does not remove the main code\-domain pattern\. On Codeforces, reasoning models retain positive corrected directness coupling among correct traces and a smaller positive coupling among incorrect traces, while baselines remain near zero in both subsets\. SAT shows the same attenuated pattern as the full analysis: reasoning models remain above baselines within each correctness subset, but baselines also show positive corrected coupling\. MATH remains small in both groups\. Because correctness still covaries with difficulty within these subsets, this check complements rather than replaces the full\-domain estimates\.

### D\.5Truncation and run\-count stability

Truncation rates are low across pipeline tiers \(Appendix[D\.1](https://arxiv.org/html/2605.15454#A4.SS1)\)\. Removing truncated runs from the trajectory analyses preserves the qualitative reasoning\-baseline separation reported in the main text\.

R1\-Distill\-Qwen\-7B on code uses 30 runs per problem\. We compare trajectory statistics computed from 5 randomly sampled runs versus all 30 runs across 100 bootstrap resamples\. The 30\-runρ⟂=\+0\.51\\rho\_\{\\perp\}=\+0\.51; the 5\-run subsample mean is\+0\.44\+0\.44\(95% CI:\[\+0\.37,\+0\.49\]\[\+0\.37,\+0\.49\]\)\. The intraclass correlation coefficient for problem\-level mean directness isICC\(1,1\)=0\.80\\mathrm\{ICC\}\(1,1\)=0\.80, indicating that approximately 80% of directness variance is between\-problem and that 5 runs per problem yield reasonably stable estimates\.

## Appendix EProbes and Interventions

### E\.1Linear difficulty decodability

We probe hidden states for linear difficulty decodability at two stages: at the final prompt token, before generation begins, and across a layer\-by\-position grid that spans the generated solution segment\. These probes test whether the corrected geometry gap is mirrored by stronger linear access to difficulty; they do not test for nonlinear representations of difficulty, nor for differences in how equally accessible information is used during generation\.

Prompt\-stage probing extracts the hidden state at the last prompt token at every transformer layer for each of the eleven matched\-pair models; the extraction reuses the forward pass that begins generation and keeps only the prompt\-token states\. Generation\-stage probing samples each trace at ten evenly spaced positions \(always including the first and last\) and at every sampled layer\. Hidden states are averaged across runs so that the effective sample size equals the number of unique problems, and trace length is residualized out of both the feature matrix and the difficulty target via OLS \(Section[4](https://arxiv.org/html/2605.15454#S4)\) before probing\.

At each layer for the prompt stage and each \(layer, position\) cell for the generation stage, we standardize features and fit a Ridge probe\[Alain and Bengio,[2017](https://arxiv.org/html/2605.15454#bib.bib36)\]\. RidgeCV selectsλ∈\{10−2,10−1,1,10,102,103,104\}\\lambda\\in\\\{10^\{\-2\},10^\{\-1\},1,10,10^\{2\},10^\{3\},10^\{4\}\\\}by leave\-one\-out cross\-validation, and generalization is estimated by 5\-fold cross\-validatedR2R^\{2\}\. A surface\-feature floor uses the same Ridge probe on five descriptors of the input prompt \(character length, word count, unique\-token ratio, numeric literal count, sentence count\) and givesR2≈0\.04R^\{2\}\\approx 0\.04on code and0\.080\.08on math\[Hewitt and Liang,[2019](https://arxiv.org/html/2605.15454#bib.bib37)\]\. A permutation null shuffles difficulty labels 100 times per heatmap cell and reuses the precomputed decomposition of𝐗train\\mathbf\{X\}\_\{\\mathrm\{train\}\}so that the procedure scales without refitting\.

Probe scores are interpreted only within matched pairs, not as cross\-family absolute quantities, because architectures differ in hidden dimension, tokenization, prompt format, and layer count\. Within that scope, peak generation\-stageR2R^\{2\}on Codeforces runs from0\.220\.22\(R1\-Distill\-Llama\-8B\) to0\.370\.37\(Phi\-4\-Reasoning\) for reasoning models, well above the surface floor\. On SAT, the eleven matched\-pair models span0\.160\.16\(Phi\-4\-Reasoning\) to0\.500\.50\(Qwen2\.5\-14B\-Instruct\); reasoning models cluster in0\.160\.16–0\.330\.33and matched baselines in0\.360\.36–0\.500\.50, reproducing the panel \(c\) gap of Figure[3](https://arxiv.org/html/2605.15454#S5.F3)at the per\-model level\.

Figure[3](https://arxiv.org/html/2605.15454#S5.F3)reports three reasoning\-minus\-baseline gaps over 18 pair\-domain records \(6 matched pairs×\\times3 domains\):ΔRprompt2\\Delta R^\{2\}\_\{\\mathrm\{prompt\}\}from peak prompt\-stageR2R^\{2\}across layers,ΔRgen2\\Delta R^\{2\}\_\{\\mathrm\{gen\}\}from peak generation\-stageR2R^\{2\}across the layer×\\timesposition grid, andΔρ⟂D\\Delta\\rho\_\{\\perp\}^\{D\}from the length\-corrected directness\-difficulty coupling\. All three are signed within\-pair differences\.

### E\.2Temporal emergence of the geometric signal

We analyze when the difficulty\-geometry coupling emerges during generation by computingρ⟂\\rho\_\{\\perp\}on progressively longer prefixes of each trace\. All metrics are computed on the generated solution segment, consistent with the main trajectory analysis pipeline\. The analysis covers all six reasoning\-baseline pairs on Codeforces and MATH\. In code, all six reasoning models show flatρ⟂\\rho\_\{\\perp\}curves from the first 10% prefix onward \(all\|Δ\|<0\.05\|\\Delta\|<0\.05between 10% and 100%\)\. In math, three of six models show building patterns: for R1\-7B and R1\-32B, the coupling builds from near zero to the full\-trace value over the course of generation; Phi\-4\-Reasoning shows the most gradual trajectory, beginning negative and crossing zero after roughly two\-thirds of generation\.

#### Prefix directness\.

For each trace, hidden states are truncated to the generated solution segment \(using the boundary detected as in Appendix[B\.1](https://arxiv.org/html/2605.15454#A2.SS1)\) before computing prefix directness\. At each prefix fractionf∈\{0\.1,0\.2,…,1\.0\}f\\in\\\{0\.1,0\.2,\\ldots,1\.0\\\}, we take the first⌊f⋅Treasoning⌋\\lfloor f\\cdot T\_\{\\mathrm\{reasoning\}\}\\rfloorstates \(minimum 3 tokens\), compute directness, average across runs per problem, and computeρ⟂\\rho\_\{\\perp\}with bootstrap CIs \(nboot=1,000n\_\{\\mathrm\{boot\}\}=1\{,\}000\)\.

#### Within\-trace segmentation\.

We divide each trace into non\-overlapping windows of 100 tokens and compute per\-window directness alongside behavioral density \(strategy\-shifting events per window, from sentence\-level LLM\-judge annotations\)\. Wilcoxon signed\-rank tests comparing shift\-dense versus shift\-sparse windows yield significant effects \(p<0\.05p<0\.05\) for 5 of 7 models tested in the code domain, with small effect sizes\. The two null results \(R1\-Distill\-Qwen\-32B,p=0\.35p=0\.35; Llama\-3\.1\-8B\-Instruct,p=0\.13p=0\.13\) suggest the within\-trace signal is not universal\. The difficulty\-geometry coupling operates primarily at the whole\-trace level, with modest within\-trace contributions\.

### E\.3Difficulty\-direction interventions

#### Direction extraction\.

From the probing heatmap, we identify the \(layer, position\) cell with the highestR2R^\{2\}and refit Ridge on the averaged, length\-residualized data at that cell\. We extract the weight vector𝐰\\mathbf\{w\}and transform it to the original feature space:

𝐝^=𝐰⊘𝐬‖𝐰⊘𝐬‖2,\\hat\{\\mathbf\{d\}\}=\\frac\{\\mathbf\{w\}\\oslash\\mathbf\{s\}\}\{\\\|\\mathbf\{w\}\\oslash\\mathbf\{s\}\\\|\_\{2\}\},\(5\)where𝐬\\mathbf\{s\}is the vector of per\-feature standard deviations and⊘\\oslashdenotes elementwise division\.

#### Sigma calibration\.

We project the training data onto𝐝^\\hat\{\\mathbf\{d\}\}and computeσproj=std\(𝐗𝐝^\)\\sigma\_\{\\mathrm\{proj\}\}=\\mathrm\{std\}\(\\mathbf\{X\}\\hat\{\\mathbf\{d\}\}\)\. This provides a natural scale:α=1\\alpha=1corresponds to a one\-standard\-deviation shift along the difficulty axis\.

#### Steering protocol\.

At each generation step, a forward hook modifies the hidden state at the target layer\[Turneret al\.,[2023](https://arxiv.org/html/2605.15454#bib.bib38), Liet al\.,[2024](https://arxiv.org/html/2605.15454#bib.bib39)\]:

𝐡t\(ℓ\)←𝐡t\(ℓ\)\+α⋅σproj⋅𝐝^,\\mathbf\{h\}\_\{t\}^\{\(\\ell\)\}\\leftarrow\\mathbf\{h\}\_\{t\}^\{\(\\ell\)\}\+\\alpha\\cdot\\sigma\_\{\\mathrm\{proj\}\}\\cdot\\hat\{\\mathbf\{d\}\},\(6\)applied at the last\-token position only\. The perturbation magnitude is negligible relative to hidden\-state norms \(∼\\sim1\.8% atα=3\.0\\alpha=3\.0\), producing null behavioral effects\.

#### Nullspace projection\.

For each layer, we project hidden states into the nullspace of the probe weight vector and recompute downstream metrics\. The relative drop inρ⟂\\rho\_\{\\perp\}is<0\.01<0\.01across all tested layers for R1\-Distill\-Qwen\-7B, indicating the probe direction carries negligible variance in the activation manifold\.

#### INLP erasure\.

Iterative nullspace projection\[Ravfogelet al\.,[2020](https://arxiv.org/html/2605.15454#bib.bib41)\]removes the top linear directions predictive of difficulty\. ProbeR2R^\{2\}drops from0\.210\.21to0\.010\.01after 3 iterations \(code\) and from0\.290\.29to0\.020\.02after 6 iterations \(math\), confirming difficulty information is concentrated in a low\-dimensional subspace\. All removed directions have similarly low cosine overlap with the activation manifold\.

#### Activation steering\.

Three steering conditions \(ridge direction, random direction, orthogonal direction\) across 9 alpha values \(−3\.0\-3\.0to\+3\.0\+3\.0\) produce no significant changes in reasoning length, backtracking frequency, or correctness\. The perturbation ratio \(projection of the steering vector onto the occupied activation subspace, normalized by hidden\-state norm\) is0\.0180\.018atα=3\.0\\alpha=3\.0, explaining the null effect\.

Together, these intervention checks separate linear decodability from causal control\. Difficulty is linearly accessible in some hidden\-state directions, but those probe directions do not provide high\-leverage controls over trajectory geometry or generated behavior\.

## Appendix FBehavioral Annotations

### F\.1Annotation categories

Three independent LLM judges classify each sentence of the generated solution segment\. Behavioral categories are defined as follows:

- •Strategy shifting: The model explicitly abandons or replaces its current approach \(e\.g\., “Let me try a different approach,” “Actually, this won’t work because…”\)\.
- •Uncertainty monitoring: The model expresses doubt about its current reasoning, hedges a conclusion, or flags a potential error without yet changing strategy \(e\.g\., “I’m not sure this is right,” “Wait, let me check…”\)\.
- •Self\-correction: The model identifies and corrects a specific error in its previous reasoning \(e\.g\., “No, that’s wrong because…,” correcting a calculation\)\.
- •Verification: The model checks a result by substitution, re\-derivation, or testing \(e\.g\., “Let me verify by plugging in…”\)\.
- •Problem restatement: The model restates the problem, constraints, or goal without advancing the solution\.
- •Subgoal decomposition: The model breaks the problem into named subproblems or explicitly sequences steps\.

### F\.2Judge protocol and aggregation

Each judge receives the full reasoning trace with sentence boundaries pre\-tokenized\. The system prompt instructs the judge to classify each sentence into exactly one category\. Majority\-vote aggregation across the three judges produces the final label\.

#### Inter\-judge agreement\.

Mean pairwise Spearmanρ\\rhoon per\-problem behavior rates:ρ=0\.85\\rho=0\.85\(range:0\.810\.81–0\.890\.89across category\-judge pairs\)\. Cohen’sκ\\kappaon sentence\-level labels:0\.720\.72\(substantial agreement\)\. Disagreements concentrate on the self\-correction / strategy\-shifting boundary, where annotators differ on whether an error acknowledgment constitutes a correction or a strategy change\.

#### Judges and metadata exposure\.

The three judges are Gemma\-2\-9B\-IT, Llama\-3\.1\-8B\-Instruct, and Qwen2\.5\-7B\-Instruct\. Each judge labels the same sentence\-segmented traces independently\. Judges receive the category definitions of Appendix[F](https://arxiv.org/html/2605.15454#A6)as their system prompt and the sentence\-segmented generated solution segment as their user prompt\. Model identity, item difficulty, correctness outcome, trajectory metrics, and matched\-pair labels are not included in the prompt; the domain is implicit because each judge run is per\-domain\.

### F\.3Residualized indirect\-effect estimates

This appendix reports residualized indirect\-effect estimates that complement Section[6](https://arxiv.org/html/2605.15454#S6)\. All variables are residualized onlog⁡N\\log Nand analyzed within model and domain\.

#### R1\-7B anchor result\.

For R1\-Distill\-Qwen\-7B in the code domain \(30 runs per problem, the richest data\), strategy shifting and uncertainty monitoring have residualized indirect\-effect proportions of82%82\\%and98%98\\%, with bootstrap 95% CIs excluding zero for both\.

#### Cross\-model results\.

Table[10](https://arxiv.org/html/2605.15454#A6.T10)reports indirect\-effect proportions for all six reasoning models on Codeforces\.

Table 10:Residualized indirect\-effect estimates: proportion of the difficulty\-directness⟂co\-variation that is statistically accounted for by each behavioral rate \(code domain\)\. All variables residualized onlog⁡N\\log N\.Bold= bootstrap CI excludes zero \(nboot=1,000n\_\{\\mathrm\{boot\}\}=1\{,\}000\)\.†\\daggerR1\-7B uses 30 runs; others use 5\. Values exceeding 100% reflect partial suppression: when the direct effect is small relative to the indirect path and the two have opposite signs, the indirect\-effect proportion can exceed unity\.ModelStrat\. ShiftUncert\. Mon\.Self\-Corr\.Verific\.R1\-Distill\-Qwen\-7B†82%98%−1%\-1\\%−24%\-24\\%R1\-Distill\-Qwen\-14B71%91%<1%<1\\%−32%\-32\\%R1\-Distill\-Qwen\-32B145%141%−5%\-5\\%−7%\-7\\%R1\-Distill\-Llama\-8B135%127%<1%<1\\%−9%\-9\\%QwQ\-32B28%28\\%12%12\\%−7%\-7\\%−27%\-27\\%Phi\-4\-Reasoning30%30\\%−6%\-6\\%−3%\-3\\%85%![Refer to caption](https://arxiv.org/html/2605.15454v1/x10.png)Figure 10:Where reasoning behaviors occur\.Each row is one DeepSeek\-R1\-7B code trace; rows are ordered by pooled\-IRT difficulty \(top: easy,b=−2\.5b\\\!=\\\!\-2\.5; bottom: hard,b=\+4\.8b\\\!=\\\!\+4\.8\), two traces per difficulty quintile\. The horizontal axis is the normalised character position in the full response \(0: start of⟨\\langlethink⟩\\rangle;1: end of answer\)\. Coloured \+ hatched spans show majority\-vote sentence labels \(at least two of three judges agree\) for the six annotated behaviors; the grey tail to the right of each row is the post\-⟨\\langle/think⟩\\rangleanswer segment, which shrinks \(relatively\) as difficulty rises\. Self\-correction and strategy shifting fire across the trace, while verification clusters near the end; harder problems show denser overlap of multiple behaviors\.The four R1\-distilled models \(sharing the DeepSeek\-R1 teacher\) show large indirect\-effect proportions for both strategy shifting \(4/4 with CIs excluding zero\) and uncertainty monitoring \(4/4\)\. QwQ\-32B and Phi\-4\-Reasoning, trained with different methods, show weaker indirect\-effect estimates \(28%28\\%and30%30\\%for strategy shifting; neither CI excludes zero with five runs\)\. Phi\-4\-Reasoning instead shows a strong verification effect \(85%85\\%\), a pattern absent in the distilled models\. Estimates exceeding 100% reflect partial suppression: when the direct effect \(c′c^\{\\prime\}\) is small relative to the indirect pathway \(abab\) and the two have opposite signs, the indirect\-effect proportion exceeds unity\.

Behavioral annotations and corrected geometry are derived from the same generated traces\. These estimates therefore describe co\-variation between observable reasoning behaviors and trajectory geometry; they do not establish that the annotated behaviors cause the geometric signal\. The strongest co\-variation is with strategy shifting and uncertainty monitoring in R1\-distilled models, marking computational reorientation rather than continued elaboration along a fixed approach\.

#### Within\-trace segmentation\.

Within\-trace segmented analysis \(dividing traces into 100\-token windows and testing whether shift\-dense windows show different directness than shift\-sparse windows\) yields mixed results: 5 of 7 models tested show significant within\-trace effects \(Wilcoxonp<0\.05p<0\.05\), but the effect sizes are small relative to the between\-problem signal\. The two null results \(R1\-Distill\-Qwen\-32B,p=0\.35p=0\.35; Llama\-3\.1\-8B\-Instruct,p=0\.13p=0\.13\) suggest the within\-trace signal is not universal\. The indirect\-effect estimate is dominated by between\-problem variation\. Figure[10](https://arxiv.org/html/2605.15454#A6.F10)visualises where these behaviors fire across ten DeepSeek\-R1\-7B code traces sampled across the difficulty range\.
Reasoning Models Don't Just Think Longer, They Move Differently

Similar Articles

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Submit Feedback

Similar Articles

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics