Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness

arXiv cs.LG Papers

Summary

This paper studies the geometric properties of chain-of-thought trajectories in the hidden state space of transformers, introducing effective dimension and kinematic features to predict task hardness and solution correctness from early tokens.

arXiv:2607.01571v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning enables large language models (LLMs) to solve complex problems by generating intermediate reasoning steps. While much attention has been paid to the length and content of these reasoning chains, far less is known about their internal geometry. We study the \emph{geometry} of CoT trajectories in the hidden state space of transformer models, formalizing each reasoning chain as a discrete curve in $\mathbb{R}^d$ and characterizing it through spectral, positional, and kinematic geometric functionals. We introduce the effective dimension $d_\rho$ as a measure of trajectory complexity and show theoretically that trajectories with flatter eigenvalue spectra correspond to harder tasks, as they explore more of the hidden dimensions. Lastly, we explore how kinematic features of the trajectory, mean position, positional dispersion, initial and current hidden states, mean velocity, mean speed, and speed dispersion, can be used to predict solution correctness before generation is complete, and may inform future early-stopping strategies. Experimentally, on mathematical reasoning problems from the MATH500 dataset, $d_\rho$ achieves $0.93$ AUC in distinguishing easy from hard problems, while kinematic features potentially can predict correctness from only the first $20\%$ of generated tokens. These correctness signatures transfer across questions of varying difficulty, establishing that the shape of a model's internal reasoning trajectory is a principled window into both task hardness and solution quality.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:42 AM

# Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness
Source: [https://arxiv.org/html/2607.01571](https://arxiv.org/html/2607.01571)
$\*$$\*$footnotetext:Equal contributionMahsa BazzazNortheastern UniversityAdel JavanmardUniversity of Southern CaliforniaGoogle ResearchVahab MirrokniGoogle Research

###### Abstract

Chain\-of\-thought \(CoT\) reasoning enables large language models \(LLMs\) to solve complex problems by generating intermediate reasoning steps\. While much attention has been paid to the length and content of these reasoning chains, far less is known about their internal geometry\. We study the*geometry*of CoT trajectories in the hidden state space of transformer models, formalizing each reasoning chain as a discrete curve inℝd\\mathbb\{R\}^\{d\}and characterizing it through spectral, positional, and kinematic geometric functionals\. We introduce the effective dimensiondρd\_\{\\rho\}as a measure of trajectory complexity and show theoretically that trajectories with flatter eigenvalue spectra correspond to harder tasks, as they explore more of the hidden dimensions\. Lastly, we explore how kinematic features of the trajectory, mean position, positional dispersion, initial and current hidden states, mean velocity, mean speed, and speed dispersion, can be used to predict solution correctness before generation is complete, and may inform future early\-stopping strategies\. Experimentally, on mathematical reasoning problems from the MATH500 dataset,dρd\_\{\\rho\}achieves0\.930\.93AUC in distinguishing easy from hard problems, while kinematic features potentially can predict correctness from only the first20%20\\%of generated tokens\. These correctness signatures transfer across questions of varying difficulty, establishing that the*shape*of a model’s internal reasoning trajectory is a principled window into both task hardness and solution quality\.

## 1Introduction

Large language models \(LLMs\) have demonstrated remarkable reasoning capabilities through chain\-of\-thought \(CoT\) prompting, where models generate intermediate reasoning steps before producing final answers\(Weiet al\.,[2022](https://arxiv.org/html/2607.01571#bib.bib36)\)\. Recent systems such as OpenAI’s o1 and DeepSeek R1 have shown that scaling test\-time compute, allowing models to “think longer”, can dramatically improve performance on complex reasoning tasks\(OpenAI,[2024](https://arxiv.org/html/2607.01571#bib.bib31); Guoet al\.,[2025](https://arxiv.org/html/2607.01571#bib.bib44)\)\. At the same time, it has been observed that simply increasing test\-time computation can harm performance, a phenomenon known as overthinking: reasoning length does not directly convert to correct answers\(Suet al\.,[2025](https://arxiv.org/html/2607.01571#bib.bib38)\)\. In general, one expects a model to engage in more deliberate reasoning for harder tasks and less for easier ones\. Recent theoretical work has further shown that, for transformers trained on an in\-context weight prediction task for linear regression, increasing test\-time compute can harm performance when the skills required to solve the downstream task are insufficiently represented in the training data\(Javanmardet al\.,[2025](https://arxiv.org/html/2607.01571#bib.bib26)\)\.

Despite these theoretical and empirical advances, fundamental questions remain: What makes a task hard for a general LLM? How does task difficulty affect the model’s internal representations, and can we predict it from those representations alone? Can we identify promising reasoning paths early, before generation is complete? These questions have profound practical implications: when generatingnncandidate solutions to a problem \(best\-of\-nnsampling\), can we prioritize which paths to pursue based on the geometry of their early trajectories?

In this paper, we address these questions by studying the geometry of chain\-of\-thought trajectories in the hidden state space of transformer models\. Our key insight is that as an LLM generates tokens during reasoning, the sequence of hidden states traces a discrete curve inℝd\\mathbb\{R\}^\{d\}, and the*geometric properties*of this curve encode information about both task difficulty and solution quality\.

![Refer to caption](https://arxiv.org/html/2607.01571v1/x1.png)Figure 1:Overview of our framework\. A question is fed to an LLM whose chain\-of\-thought generation traces a discrete curveγ=\(h1,…,hn\)∈ℝd\\gamma=\(h\_\{1\},\\ldots,h\_\{n\}\)\\in\\mathbb\{R\}^\{d\}in hidden\-state space\.\(A\)Each generated token produces a hidden state, forming a trajectory whose geometry we analyze\.\(B\)Hard problems induce higher\-dimensional trajectories than easy ones: the effective dimensiondρd\_\{\\rho\}has mean≈170\\approx 170for hard problems versus≈122\\approx 122for easy ones, providing a geometric measure of task hardness\.\(C\)Whether a reasoning chain will reach a correct answer is partially detectable from geometric functionals of the early trajectory: kinematic features extracted from only the first20%20\\%of generated tokens predict solution correctness with high AUC before generation is complete\.Our contributions are as follows:

- •Formal Framework for CoT Geometry \(Section[3](https://arxiv.org/html/2607.01571#S3)\): We formalize CoT reasoning as a discrete curve inℝd\\mathbb\{R\}^\{d\}and introduce geometric functionals that extract spectral, positional, and kinematic properties of reasoning trajectories\.
- •Effective Dimension as Task Complexity \(Section[4](https://arxiv.org/html/2607.01571#S4)\): We introduce a geometrical function capturing hardness of the task\. More precisely, the effective dimensiondρd\_\{\\rho\}of reasoning curves as a principled measure of task hardness\. We further, characterize which curves attain the highest effective dimension, establishing them as geometric representatives of the hardest tasks\.
- •Hardness Prediction \(Section[5\.3](https://arxiv.org/html/2607.01571#S5.SS3)\): Using only effective dimension features, we achieve AUC\>0\.93\>0\.93in predicting whether a mathematical problem is easy or hard\.
- •Correctness Prediction \(Section[5\.2](https://arxiv.org/html/2607.01571#S5.SS2)\): Seven kinematic and positional features of the trajectory predict solution correctness with AUC=0\.806=0\.806from only the first20%20\\%of generated tokens, with promising implications for early\-exit strategies and best\-of\-nnranking\.

## 2Related Work

Chain\-of\-thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2607.01571#bib.bib36); Kojimaet al\.,[2022](https://arxiv.org/html/2607.01571#bib.bib27)\)has emerged as a powerful technique for eliciting multi\-step reasoning in LLMs\. Recent work has explored scaling test\-time compute\(Snellet al\.,[2024](https://arxiv.org/html/2607.01571#bib.bib34); Wellecket al\.,[2024](https://arxiv.org/html/2607.01571#bib.bib37); Muennighoffet al\.,[2025](https://arxiv.org/html/2607.01571#bib.bib39)\), with systems like OpenAI o1\(OpenAI,[2024](https://arxiv.org/html/2607.01571#bib.bib31)\)and DeepSeek R1\(Guoet al\.,[2025](https://arxiv.org/html/2607.01571#bib.bib44)\)demonstrating strong performance through extended reasoning chains\. A complementary line of work has observed that more reasoning is not always better: overthinking can degrade performance when the skills required for a task are underrepresented in training\(Suet al\.,[2025](https://arxiv.org/html/2607.01571#bib.bib38)\)\. Our work studies these phenomena from a geometric angle, asking not how long a chain is but what shape it traces in hidden state space\.

Javanmardet al\.\([2025](https://arxiv.org/html/2607.01571#bib.bib26)\)provide a theoretical analysis of test\-time scaling for transformers trained on in\-context weight prediction for linear regression\. They characterize task hardness via the ratio of the trace with the minimum eigenvalue of the feature covariance matrix, showing that harder tasks require longer chains\-of\-thought to reach a given error level, and that insufficient task coverage in training can cause additional reasoning steps to hurt performance\. Our work is complementary but distinct in two ways\. First, we study task hardness empirically in a general LLM rather than deriving it from a tractable linear model\. Second, and more fundamentally, we shift the unit of analysis from the output chain to the internal hidden state trajectory: we show that task hardness leaves a geometric signature in the model’s representation space, captured by the effective dimensiondρd\_\{\\rho\}of the trajectory covariance, and that this quantity alone is highly predictive of problem difficulty

Korbaket al\.\([2025](https://arxiv.org/html/2607.01571#bib.bib40)\)argue that chain\-of\-thought reasoning offers a unique safety opportunity because, for sufficiently hard tasks, transformers must externalize reasoning through the CoT in order to complete it, making that reasoning in principle observable\. They focus on the content of the generated text as the monitoring signal and discuss conditions under which this signal may degrade\. Our work operates at a different level: rather than reading the textual content of the chain, we read the*geometry*of the hidden states that produce it\. The two perspectives are complementary, CoT text monitoring and hidden\-state trajectory analysis can in principle be combined, but our approach is model\-internal and does not rely on the model producing legible natural language reasoning\.

Sunet al\.\([2026](https://arxiv.org/html/2607.01571#bib.bib41)\)study LLM reasoning as a structured trajectory in representation space, extracting hidden states at explicit step boundaries \(“Step 1:”, “Step 2:”, …\) and showing that these activations form linearly separable, step\-specific subspaces that become more pronounced with layer depth\. For correctness prediction, they achieve high AUC using late\-step trajectory features, and explore inference\-time interventions such as activation steeringTurneret al\.\([2023](https://arxiv.org/html/2607.01571#bib.bib46)\)to correct deviating trajectories\. Our work shares the trajectory perspective but pursues different goals\. Rather than analyzing step\-boundary activations, we treat the full token\-level hidden state sequence as a continuous curve and characterize it through spectral and kinematic geometric functionals\. This allows us to ask whether trajectory geometry encodes task difficulty\. We show that the effective dimensiondρd\_\{\\rho\}of the trajectory covariance, a spectral property of the curve as a whole, predicts whether a problem is easy or hard with high AUC, and we provide a theoretical account of why harder tasks necessarily induce higher\-dimensional trajectories\. We further show that kinematic features of the trajectory carry an early correctness signal that is detectable from only the first2020percent of generated tokens, opening a practical route to early stopping and best\-of\-nnranking without waiting for generation to complete\.

Recent work also has proposed geometric frameworks for understanding how LLMs reason\.Zhouet al\.\([2025](https://arxiv.org/html/2607.01571#bib.bib45)\)model reasoning as smooth flows in representation space, using the velocity and Menger curvature of the trajectory to show that logical structure, rather than surface semantics, governs the direction and magnitude of these flows\. Their focus is on interpretability\. Our work takes a complementary direction: we use geometric functionals of the hidden\-state trajectory, specifically the spectral effective dimension and kinematic summaries, to predict task difficulty and solution correctness, connecting trajectory geometry directly to downstream performance\.

Lastly,Prasadet al\.\([2026](https://arxiv.org/html/2607.01571#bib.bib42)\)show that effective reasoning strategies reduce the intrinsic dimensionality of the learning objective, measured as the minimum number of LoRA parameters needed to fine\-tune a model to a given accuracy threshold on GSM8K\. They fix the model and vary the reasoning strategy, finding that lower intrinsic dimensionality correlates strongly with better generalization\. While both their work and ours use notions of dimensionality to characterize reasoning, the two measures are conceptually distinct\. Their intrinsic dimension is a property of the*learning problem*induced by a reasoning strategy, it requires fine\-tuning experiments and measures how compressible a dataset of reasoning chains is\. Our effective dimensiondρd\_\{\\rho\}is a property of a*single inference trajectory*, it is computed from the covariance of hidden states produced during one forward pass and requires no training\. This makes our measure applicable at inference time and enables per\-instance predictions of task difficulty and solution correctness\.

## 3Problem Formulation

In this section we formalize Chain\-of\-Thought reasoning and develop a mathematical framework for characterizing its dynamics via Geometrical Functionals\. Consider a transformer language model withLLlayers and hidden dimensiondd\. Let𝒱\\mathcal\{V\}denote the finite vocabulary, and let𝒱∗:=⋃n=0∞𝒱n\\mathcal\{V\}^\{\*\}:=\\bigcup\_\{n=0\}^\{\\infty\}\\mathcal\{V\}^\{n\}denote the set of all finite sequences over𝒱\\mathcal\{V\}\(the Kleene star of𝒱\\mathcal\{V\}\)\. LetΔ​\(𝒱\)\\Delta\(\\mathcal\{V\}\)denote the simplex of probability measures over𝒱\\mathcal\{V\}\. The model defines a map from finite token sequences to probability measures over the next token:

μ:𝒱∗→Δ\(𝒱\),\(x1,…,xt\)↦μ\(⋅∣x1,…,xt\)\.\\mu:\\mathcal\{V\}^\{\*\}\\to\\Delta\(\\mathcal\{V\}\),\\quad\(x\_\{1\},\\ldots,x\_\{t\}\)\\mapsto\\mu\(\\cdot\\mid x\_\{1\},\\ldots,x\_\{t\}\)\.\(1\)
As such for each layerℓ∈\{1,…,L\}\\ell\\in\\\{1,\\ldots,L\\\}, the model also produces a hidden state representation inℝd\\mathbb\{R\}^\{d\}whereddis the dimension of the latent representation, i\.e\.,:

f\(ℓ\):𝒱∗→ℝd,\(x1,…,xt\)↦Ht\(ℓ\)∈ℝd\.f^\{\(\\ell\)\}:\\mathcal\{V\}^\{\*\}\\to\\mathbb\{R\}^\{d\},\\quad\(x\_\{1\},\\ldots,x\_\{t\}\)\\mapsto H\_\{t\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}\.\(2\)
When the layerℓ\\ellis fixed or clear from context, we writeHt≡Ht\(ℓ\)H\_\{t\}\\equiv H\_\{t\}^\{\(\\ell\)\}\. Given the distributionμ\(⋅∣x1,…,xt\)∈Δ\(𝒱\)\\mu\(\\cdot\\mid x\_\{1\},\\ldots,x\_\{t\}\)\\in\\Delta\(\\mathcal\{V\}\), the next token is selected according to a temperature parameterT≥0T\\geq 0\. In particular, At temperatureT\>0T\>0, we sample from a tempered distribution:

Xt\+1∼μT\(⋅∣x1,…,xt\),whereμT\(x\)∝μ\(x\)1/T\.X\_\{t\+1\}\\sim\\mu\_\{T\}\(\\cdot\\mid x\_\{1\},\\ldots,x\_\{t\}\),\\quad\\text\{where \}\\mu\_\{T\}\(x\)\\propto\\mu\(x\)^\{1/T\}\.\(3\)
At temperatureT=0T=0, the distribution concentrates on the mode:

Xt\+1=arg⁡maxx∈𝒱⁡μ​\(x∣x1,…,xt\)\.X\_\{t\+1\}=\\arg\\max\_\{x\\in\\mathcal\{V\}\}\\mu\(x\\mid x\_\{1\},\\ldots,x\_\{t\}\)\.\(4\)This distinction is fundamental: atT=0T=0, given a prompt, the generated sequence is unique; atT\>0T\>0, the same prompt yields a distribution over sequences\.

### 3\.1The Space of CoT Curves

AtT=0T=0, token selection is deterministic\. Given a prompt, there is exactly one generated sequence of tokens, this motivates the following definition of the space of discrete curves,

###### Definition 1\.

Fix a maximum sequence lengthn∈ℕn\\in\\mathbb\{N\}\. The space of discrete curves of lengthnnis𝒞n:=\(ℝd\)n\\mathcal\{C\}\_\{n\}:=\(\\mathbb\{R\}^\{d\}\)^\{n\}\. An elementγ∈𝒞n\\gamma\\in\\mathcal\{C\}\_\{n\}is a tupleγ=\(h1,…,hn\)\\gamma=\(h\_\{1\},\\ldots,h\_\{n\}\)wherehi∈ℝdh\_\{i\}\\in\\mathbb\{R\}^\{d\}\.

We have a natural embedding of𝒞m⊆𝒞n\\mathcal\{C\}\_\{m\}\\subseteq\\mathcal\{C\}\_\{n\}form≤nm\\leq nby repeating the last elementn−mn\-mtimes \(in practice, we do not apply this padding but instead work directly with variable\-length trajectories\)\. Let𝒫⊂𝒱∗\\mathcal\{P\}\\subset\\mathcal\{V\}^\{\*\}denote the space of input prompts\. AtT=0T=0, the model defines a deterministic map from the set of prompts to the space of curves\. In particular, we have:

ℋ:𝒫→𝒞n,P↦\(h1,…,hm,hm,…,hm\),\\mathcal\{H\}:\\mathcal\{P\}\\to\\mathcal\{C\}\_\{n\},\\quad P\\mapsto\(h\_\{1\},\\ldots,h\_\{m\},h\_\{m\},\\ldots,h\_\{m\}\),\(5\)wheremmis the generation length and the final stateHmH\_\{m\}is repeated to fill lengthnn\(stationary extension\)\. As such we introduce the lengthnnchain of thought as an element in the space of discrete curves of lengthnn\. In other words, given aω∈𝒱∗\\omega\\in\\mathcal\{V\}^\{\*\}the length\-nnChain\-of\-Thought curve is the elementℋ​\(ω\)=\(h1​\(ω\),…,hn​\(ω\)\)∈𝒞n\\mathcal\{H\}\(\\omega\)=\(h\_\{1\}\(\\omega\),\\ldots,h\_\{n\}\(\\omega\)\)\\in\\mathcal\{C\}\_\{n\}produced by generation fromω\\omega, with stationary extension if necessary\. We characterize CoT curves through real\-or vector valued functionals, in particular:

###### Definition 2\.

A vector\-valued Geometric Functional is a functionφ:𝒞n→ℝk\\varphi:\\mathcal\{C\}\_\{n\}\\to\\mathbb\{R\}^\{k\}that extracts geometric properties of curves\. The compositionφ∘ℋ:𝒱∗→ℝk\\varphi\\circ\\mathcal\{H\}:\\mathcal\{V\}^\{\*\}\\to\\mathbb\{R\}^\{k\}characterizes how these properties vary across prompts\.

As an example, for a curveγ=\(h1,…,hn\)∈𝒞n\\gamma=\(h\_\{1\},\\ldots,h\_\{n\}\)\\in\\mathcal\{C\}\_\{n\}, define the centered curve with elementsh¯t=ht−1n​∑i=1nhi\\bar\{h\}\_\{t\}=h\_\{t\}\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}h\_\{i\}and its trajectory covariance matrix, defined asC​\(γ\):=1n​∑t=1nh¯t​h¯t⊤=1n​H¯⊤​H¯∈ℝd×d,C\(\\gamma\):=\\frac\{1\}\{n\}\\sum\_\{t=1\}^\{n\}\\bar\{h\}\_\{t\}\\bar\{h\}\_\{t\}^\{\\top\}=\\frac\{1\}\{n\}\\bar\{H\}^\{\\top\}\\bar\{H\}\\in\\mathbb\{R\}^\{d\\times d\},whereH¯=\[h¯1,…,h¯n\]⊤∈ℝn×d\\bar\{H\}=\[\\bar\{h\}\_\{1\},\\ldots,\\bar\{h\}\_\{n\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\\times d\}\. As suchCCis a geometrical functional that takes a curve and produces an element inℝd2\\mathbb\{R\}^\{d^\{2\}\}\.

Letν1≥ν2≥⋯≥νd≥0\\nu\_\{1\}\\geq\\nu\_\{2\}\\geq\\cdots\\geq\\nu\_\{d\}\\geq 0be the eigenvalues ofC​\(γ\)C\(\\gamma\)ordered in a non\-increasing order, then we introduce another such important geometrical functional as follows:

###### Definition 3\.

Forρ∈\(0,1\]\\rho\\in\(0,1\], the Effective Dimension at thresholdρ\\rhois:

dρ​\(γ\):=min⁡\{k∈\{1,…,d\}:∑i=1kνitr​\(C​\(γ\)\)≥ρ\}\.d\_\{\\rho\}\(\\gamma\):=\\min\\left\\\{k\\in\\\{1,\\ldots,d\\\}:\\frac\{\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}\}\{\\mathrm\{tr\}\(C\(\\gamma\)\)\}\\geq\\rho\\right\\\}\.\(6\)This is the minimum number of principal components needed to capture at leastρ\\rhofraction of the total variance\.

The effective dimension measures the*intrinsic dimensionality*of the reasoning trajectory\. A lowdρd\_\{\\rho\}indicates the trajectory lies near a low\-dimensional subspace \(simple, structured reasoning\), while highdρd\_\{\\rho\}indicates the trajectory explores many directions \(complex, multi\-faceted reasoning\)\.

Lastly, we introduce seven additional geometric functionals, which we use in Section[5\.2](https://arxiv.org/html/2607.01571#S5.SS2)for correctness prediction\. Letγ=\(h1,…,hn\)∈𝒞n\\gamma=\(h\_\{1\},\\ldots,h\_\{n\}\)\\in\\mathcal\{C\}\_\{n\}be a trajectory, and letP∈ℝd×kP\\in\\mathbb\{R\}^\{d\\times k\}denote the top\-kkPCA basis fitted on the training set\. Define the projected trajectory with elementsh~t=P⊤​ht∈ℝk\\tilde\{h\}\_\{t\}=P^\{\\top\}h\_\{t\}\\in\\mathbb\{R\}^\{k\}fort=1,…,nt=1,\\ldots,n, and letγ~=\(h~1,…,h~n\)\\tilde\{\\gamma\}=\(\\tilde\{h\}\_\{1\},\\ldots,\\tilde\{h\}\_\{n\}\)be the projected curve\. Note that to prevent data leakage, when a fractionα∈\(0,1\]\\alpha\\in\(0,1\]of the trajectory is observed, we restrict to the windowγ~\(α\)=\(h~1,…,h~⌊α​n⌋\)\\tilde\{\\gamma\}^\{\(\\alpha\)\}=\(\\tilde\{h\}\_\{1\},\\ldots,\\tilde\{h\}\_\{\\lfloor\\alpha n\\rfloor\}\)\. Noting that our framework is more general and one can extract many other meaningful functional as needed, we define the following functionalsφ:𝒞n→ℝ5​k\+2\\varphi:\\mathcal\{C\}\_\{n\}\\to\\mathbb\{R\}^\{5k\+2\}below\.

Examples*\(Kinematic and Positional Geometric Functionals\)\.*Given a projected trajectoryγ~=\(h~1,…,h~m\)\\tilde\{\\gamma\}=\(\\tilde\{h\}\_\{1\},\\ldots,\\tilde\{h\}\_\{m\}\)for somem≤nm\\leq n, define the velocity incrementsΔt:=h~t\+1−h~t∈ℝk\\Delta\_\{t\}:=\\tilde\{h\}\_\{t\+1\}\-\\tilde\{h\}\_\{t\}\\in\\mathbb\{R\}^\{k\}for1≤∀t≤m−11\\leq\\forall t\\leq m\-1\. The seven geometric functionals are:

1. 1\.Mean position, defined asμ​\(γ~\)=1m​∑t=1mh~t∈ℝk\.\\mu\(\\tilde\{\\gamma\}\)\\;=\\;\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\tilde\{h\}\_\{t\}\\;\\in\\;\\mathbb\{R\}^\{k\}\.
2. 2\.Positional dispersion, the coordinate\-wise standard deviation: σ​\(γ~\)=\(1m​∑t=1m\(h~t−μ​\(γ~\)\)⊙2\)⊙1/2∈ℝk,\\sigma\(\\tilde\{\\gamma\}\)\\;=\\;\\left\(\\frac\{1\}\{m\}\\sum\_\{t=1\}^\{m\}\\bigl\(\\tilde\{h\}\_\{t\}\-\\mu\(\\tilde\{\\gamma\}\)\\bigr\)^\{\\odot 2\}\\right\)^\{\\odot 1/2\}\\;\\in\\;\\mathbb\{R\}^\{k\},\(7\)where⊙\\odotdenotes elementwise operations\.
3. 3\.Initial hidden state, the first token representation of the projected trajectory, i\.e\.,h~1∈ℝk\.\\tilde\{h\}\_\{1\}\\;\\in\\;\\mathbb\{R\}^\{k\}\.
4. 4\.Final hidden state, the last token representation of the projected trajectory, i\.e\.,h~m∈ℝk\.\\tilde\{h\}\_\{m\}\\;\\in\\;\\mathbb\{R\}^\{k\}\.
5. 5\.Mean velocity, the average of successive differences: v¯​\(γ~\)=1m−1​∑t=1m−1Δt∈ℝk\.\\bar\{v\}\(\\tilde\{\\gamma\}\)\\;=\\;\\frac\{1\}\{m\-1\}\\sum\_\{t=1\}^\{m\-1\}\\Delta\_\{t\}\\;\\in\\;\\mathbb\{R\}^\{k\}\.\(8\)
6. 6\.Mean speed, the average step\-wise Euclidean norm: s¯​\(γ~\)=1m−1​∑t=1m−1‖Δt‖2∈ℝ\.\\bar\{s\}\(\\tilde\{\\gamma\}\)\\;=\\;\\frac\{1\}\{m\-1\}\\sum\_\{t=1\}^\{m\-1\}\\\|\\Delta\_\{t\}\\\|\_\{2\}\\;\\in\\;\\mathbb\{R\}\.\(9\)
7. 7\.Speed dispersion, the standard deviation of step\-wise speeds: σs​\(γ~\)=\(1m−1​∑t=1m−1\(‖Δt‖2−s¯​\(γ~\)\)2\)1/2∈ℝ\.\\sigma\_\{s\}\(\\tilde\{\\gamma\}\)\\;=\\;\\left\(\\frac\{1\}\{m\-1\}\\sum\_\{t=1\}^\{m\-1\}\\bigl\(\\\|\\Delta\_\{t\}\\\|\_\{2\}\-\\bar\{s\}\(\\tilde\{\\gamma\}\)\\bigr\)^\{2\}\\right\)^\{1/2\}\\;\\in\\;\\mathbb\{R\}\.\(10\)

The full feature vector is the concatenation

φ​\(γ~\)=\(μ,σ,h~1,h~m,v¯,s¯,σs\)∈ℝ5​k\+2\.\\varphi\(\\tilde\{\\gamma\}\)\\;=\\;\\Bigl\(\\mu,\\;\\sigma,\\;\\tilde\{h\}\_\{1\},\\;\\tilde\{h\}\_\{m\},\\;\\bar\{v\},\\;\\bar\{s\},\\;\\sigma\_\{s\}\\Bigr\)\\;\\in\\;\\mathbb\{R\}^\{5k\+2\}\.\(11\)
Note that mean velocity telescopes to1m−1​\(h~m−h~1\)\\frac\{1\}\{m\-1\}\(\\tilde\{h\}\_\{m\}\-\\tilde\{h\}\_\{1\}\), making it a linear function of the already\-included initial and final states\. We retain it for completeness and its natural connection to mean speed and speed dispersion\.

At temperatureT\>0T\>0, the same prompt could yields different curves at each generation time\. Our formulation can extend to this framework:

ℋ:𝒱∗→Δ​\(𝒞n\),\\mathcal\{H\}:\\mathcal\{V\}^\{\*\}\\to\\Delta\(\\mathcal\{C\}\_\{n\}\),\(12\)whereΔ​\(𝒞n\)\\Delta\(\\mathcal\{C\}\_\{n\}\)denotes distributions over curves\. The induced distribution arises from the autoregressive measure:

ℙμT​\(v1,…,vn\)=∏t=1nμT​\(vt∣x1,…,xk,v1,…,vt−1\)\.\\mathbb\{P\}\_\{\\mu\_\{T\}\}\(v\_\{1\},\\ldots,v\_\{n\}\)=\\prod\_\{t=1\}^\{n\}\\mu\_\{T\}\(v\_\{t\}\\mid x\_\{1\},\\ldots,x\_\{k\},v\_\{1\},\\ldots,v\_\{t\-1\}\)\.\(13\)

## 4Effective Dimension as Task Complexity

This section develops the theoretical core of the paper\. We prove general spectral bounds for any covariance matrix, showing via a majorization argument that flat spectra maximize effective dimension \(§[4\.1](https://arxiv.org/html/2607.01571#S4.SS1)\) and provide finite\-sample stability\. All proofs are in Appendix[A](https://arxiv.org/html/2607.01571#A1)\.

### 4\.1Spectral Bounds and the Role of Flatness

The effective dimension of any PSD matrix is controlled by its eigenvalue spread\. These are purely linear\-algebraic facts, independent of any dynamical model\.

###### Proposition 1\(Spectral Bounds\)\.

For any PSD matrixCCwith eigenvaluesν1≥⋯≥νd≥0\\nu\_\{1\}\\geq\\cdots\\geq\\nu\_\{d\}\\geq 0,ν1\>0\\nu\_\{1\}\>0:

⌈ρ​tr​\(C\)/ν1⌉≤dρ​\(C\)≤⌈ρ​tr​\(C\)/νd⌉,\\lceil\\rho\\,\\mathrm\{tr\}\(C\)/\\nu\_\{1\}\\rceil\\;\\leq\\;d\_\{\\rho\}\(C\)\\;\\leq\\;\\lceil\\rho\\,\\mathrm\{tr\}\(C\)/\\nu\_\{d\}\\rceil,\(14\)the upper bound requiringνd\>0\\nu\_\{d\}\>0\.

###### Proof sketch\.

Letr:=dρ​\(C\)r:=d\_\{\\rho\}\(C\)\. The lower bound follows fromρ​tr​\(C\)≤∑j≤rνj≤r​ν1\\rho\\mathrm\{tr\}\(C\)\\leq\\sum\_\{j\\leq r\}\\nu\_\{j\}\\leq r\\nu\_\{1\}, givingr≥⌈ρ​tr​\(C\)/ν1⌉r\\geq\\lceil\\rho\\mathrm\{tr\}\(C\)/\\nu\_\{1\}\\rceil\. The upper bound follows from∑j≤r−1νj<ρ​tr​\(C\)\\sum\_\{j\\leq r\-1\}\\nu\_\{j\}<\\rho\\mathrm\{tr\}\(C\)andνj≥νd\\nu\_\{j\}\\geq\\nu\_\{d\}, giving\(r−1\)​νd<ρ​tr​\(C\)\(r\-1\)\\nu\_\{d\}<\\rho\\mathrm\{tr\}\(C\), hencer≤⌈ρ​tr​\(C\)/νd⌉r\\leq\\lceil\\rho\\mathrm\{tr\}\(C\)/\\nu\_\{d\}\\rceil\.The complete argument is in Appendix[A](https://arxiv.org/html/2607.01571#A1)\. ∎

The ratioT/νdT/\\nu\_\{d\}is similar \(they used covariance of the data rather than the dynamic\) to the hardness measure ofJavanmardet al\.\([2025](https://arxiv.org/html/2607.01571#bib.bib26)\); the upper bound shows it controlsdρd\_\{\\rho\}, but*loosely*\. When the spectrum is flat, the bounds coincide at⌈ρ​d⌉\\lceil\\rho d\\rceil\. Flatness is extremal in a stronger sense, captured by majorization\.

###### Definition 4\.

Forx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}, we sayxxis majorized byyy, writtenx≺yx\\prec y, if

∑i=1kx\[i\]≤∑i=1ky\[i\]for all​k=1,…,d−1,and∑i=1dx\[i\]=∑i=1dy\[i\],\\sum\_\{i=1\}^\{k\}x\_\{\[i\]\}\\;\\leq\\;\\sum\_\{i=1\}^\{k\}y\_\{\[i\]\}\\quad\\text\{for all \}k=1,\\ldots,d\-1,\\quad\\text\{and\}\\quad\\sum\_\{i=1\}^\{d\}x\_\{\[i\]\}=\\sum\_\{i=1\}^\{d\}y\_\{\[i\]\},\(15\)wherex\[1\]≥⋯≥x\[d\]x\_\{\[1\]\}\\geq\\cdots\\geq x\_\{\[d\]\}is the decreasing rearrangement ofxx\. Informally:xx’s mass is more “spread out” thanyy’s, but both have the same totalMarshallet al\.\([2011](https://arxiv.org/html/2607.01571#bib.bib47)\)\.

Majorization gives us a precise way to compare how “peaked” two spectra are\. We use it to show that the flat spectrumν⋆\\nu^\{\\star\}is the most spread\-out among all spectra with the same trace, and achieves the highest effective dimension\. More formally we have:

###### Proposition 2\(Flat Spectrum Maximizes Effective Dimension\)\.

LetCCbe PSD with eigenvaluesν∈ℝd\\nu\\in\\mathbb\{R\}^\{d\}and tracetr​\(C\)\\mathrm\{tr\}\(C\)\. Letν⋆:=\(tr​\(C\)/d,…,tr​\(C\)/d\)\\nu^\{\\star\}:=\(\\mathrm\{tr\}\(C\)/d,\\ldots,\\mathrm\{tr\}\(C\)/d\), the flat spectrum with the same total\. Then:

1. 1\.ν⋆≺ν\\nu^\{\\star\}\\prec\\nu: the flat spectrum is majorized by any other spectrum with the same total\.
2. 2\.dρ​\(C\)≤dρ​\(C⋆\)=⌈ρ​d⌉d\_\{\\rho\}\(C\)\\leq d\_\{\\rho\}\(C^\{\\star\}\)=\\lceil\\rho d\\rceil\.

###### Proof sketch\.

We first showν⋆≺ν\\nu^\{\\star\}\\prec\\nu: since both vectors have totaltr​\(C\)\\mathrm\{tr\}\(C\), this reduces to∑i≤kνi≥k​tr​\(C\)/d\\sum\_\{i\\leq k\}\\nu\_\{i\}\\geq k\\mathrm\{tr\}\(C\)/d, i\.e\., the top\-kkaverage is at least the overall mean\. If not, thenνk\\nu\_\{k\}\(and hence everyνj\\nu\_\{j\}withj\>kj\>k, by decreasing order\) is also belowtr​\(C\)/d\\mathrm\{tr\}\(C\)/d, sotr​\(C\)=∑iνi<tr​\(C\)\\mathrm\{tr\}\(C\)=\\sum\_\{i\}\\nu\_\{i\}<\\mathrm\{tr\}\(C\), a contradiction\. The second one follows by definition and using part 1\. The complete argument is in Appendix[A](https://arxiv.org/html/2607.01571#A1)\. ∎

### 4\.2Stability under small Perturbation

The following framework\-independent and purely linear\-algebraic result shows that if two covariance matrices areϵ\\epsilon\-close in operator norm\(which isϵ\\epsilon\-close in operator norm \(‖C^−C‖op:=max‖v‖=1⁡‖\(C^−C\)​v‖\\\|\\widehat\{C\}\-C\\\|\_\{\\mathrm\{op\}\}:=\\max\_\{\\\|v\\\|=1\}\\\|\(\\widehat\{C\}\-C\)v\\\|\), their effective dimensions are also close\.

###### Theorem 3\.

LetC,C^C,\\widehat\{C\}be PSD with‖C^−C‖op≤ϵ\\\|\\widehat\{C\}\-C\\\|\_\{\\mathrm\{op\}\}\\leq\\epsilonandtr​\(C\)\>0\\mathrm\{tr\}\(C\)\>0\. Assumed​ϵ≤tr​\(C\)/2d\\epsilon\\leq\\mathrm\{tr\}\(C\)/2\. DefineF​\(j\):=∑i≤jνi/tr​\(C\)F\(j\):=\\sum\_\{i\\leq j\}\\nu\_\{i\}/\\mathrm\{tr\}\(C\)whereνi\\nu\_\{i\}are the decreasingly\-sorted eigenvalues ofCC\. Then

\|dρ​\(C^\)−dρ​\(C\)\|≤\#​\{j∈\{1,…,d\}:\|F​\(j\)−ρ\|≤4​d​ϵtr​\(C\)\}\.\\bigl\|d\_\{\\rho\}\(\\widehat\{C\}\)\-d\_\{\\rho\}\(C\)\\bigr\|\\;\\leq\\;\\\#\\\!\\left\\\{j\\in\\\{1,\\ldots,d\\\}:\|F\(j\)\-\\rho\|\\leq\\tfrac\{4d\\epsilon\}\{\\mathrm\{tr\}\(C\)\}\\right\\\}\.\(16\)In particular, if the cumulative mass functionFFcrosses levelρ\\rhotransversally \(i\.e\.,FFhas no index within distance4​d​ϵ/tr​\(C\)4d\\epsilon/\\mathrm\{tr\}\(C\)ofρ\\rho\), thendρ​\(C^\)=dρ​\(C\)d\_\{\\rho\}\(\\widehat\{C\}\)=d\_\{\\rho\}\(C\)\.

###### Proof sketch\.

The argument proceeds in three steps\. First, recall*Weyl’s inequality*: for symmetric matricesA,B∈ℝd×dA,B\\in\\mathbb\{R\}^\{d\\times d\}with eigenvalues sorted in decreasing order,\|λk​\(A\+B\)−λk​\(A\)\|≤‖B‖op\|\\lambda\_\{k\}\(A\+B\)\-\\lambda\_\{k\}\(A\)\|\\leq\\\|B\\\|\_\{\\mathrm\{op\}\}for everykk\. Applied withA=CA=CandB=C^−CB=\\widehat\{C\}\-C, this gives\|ν^k−νk\|≤ϵ\|\\hat\{\\nu\}\_\{k\}\-\\nu\_\{k\}\|\\leq\\epsilonfor allkk: each eigenvalue ofC^\\widehat\{C\}is withinϵ\\epsilonof the corresponding eigenvalue ofCC\. Summing acrosskk, we also get\|tr​\(C^\)−tr​\(C\)\|≤d​ϵ\|\\mathrm\{tr\}\(\\widehat\{C\}\)\-\\mathrm\{tr\}\(C\)\|\\leq d\\epsilon, and similarly the partial sumsSj:=∑i≤jνiS\_\{j\}:=\\sum\_\{i\\leq j\}\\nu\_\{i\}andS^j\\widehat\{S\}\_\{j\}differ by at mostd​ϵd\\epsilon\.

Second, we propagate this to the cumulative\-mass functionF​\(j\):=Sj/tr​\(C\)F\(j\):=S\_\{j\}/\\mathrm\{tr\}\(C\)\. Using the algebraic identityS^j​tr​\(C\)−Sj​tr​\(C^\)=\(S^j−Sj\)​tr​\(C\)\+Sj​\(tr​\(C\)−tr​\(C^\)\)\\widehat\{S\}\_\{j\}\\mathrm\{tr\}\(C\)\-S\_\{j\}\\mathrm\{tr\}\(\\widehat\{C\}\)=\(\\widehat\{S\}\_\{j\}\-S\_\{j\}\)\\mathrm\{tr\}\(C\)\+S\_\{j\}\(\\mathrm\{tr\}\(C\)\-\\mathrm\{tr\}\(\\widehat\{C\}\)\)and bounding each piece via the triangle inequality, we obtain\|F^\(j\)−F\(j\)\|≤4dϵ/tr\(C\)=:δ\|\\widehat\{F\}\(j\)\-F\(j\)\|\\leq 4d\\epsilon/\\mathrm\{tr\}\(C\)=:\\delta\.

Finally, sincedρd\_\{\\rho\}is the first index at whichFFreachesρ\\rho, and\|F​\(j\)−F^​\(j\)\|≤δ\|F\(j\)\-\\widehat\{F\}\(j\)\|\\leq\\deltaeverywhere, the two effective dimensions can only disagree at indices whereF​\(j\)F\(j\)lies withinδ\\deltaofρ\\rho: away from this band,FFandF^\\widehat\{F\}agree on whether the threshold has been crossed\. Counting such indices yields \([16](https://arxiv.org/html/2607.01571#S4.E16)\)\. In particular, ifFFjumps pastρ\\rhotransversally at a single index, with nojjsatisfying\|F​\(j\)−ρ\|≤δ\|F\(j\)\-\\rho\|\\leq\\delta, thendρ​\(C^\)=dρ​\(C\)d\_\{\\rho\}\(\\widehat\{C\}\)=d\_\{\\rho\}\(C\)exactly\. The complete argument is in Appendix[A](https://arxiv.org/html/2607.01571#A1)\. ∎

## 5Experiments

### 5\.1Experimental Setup

#### Model\.

We use Qwen2\.5\-0\.5B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2607.01571#bib.bib32)\), a decoder\-only transformer with 24 layers and hidden dimension 896\. Despite its small size, it produces well\-structured reasoning chains and allows us to extract hidden states at every layer and token position without prohibitive memory cost\.

#### Dataset\.

We focus on three categories from the MATH500 dataset\(Hendryckset al\.,[2021](https://arxiv.org/html/2607.01571#bib.bib25)\): Algebra and Counting & Probability and Precalculus\. Problems in the MATH500 dataset are labeled with difficulty annotations ranging from11\-55\. We consider the problems with annotations of11,33and55, which we respectively label aseasy,mediumandhard\.

We use a fixed set of99\(probability had22easy question rather than33\) questions per category, drawn to ensure a balanced difficulty split\. In the case of comparing effective dimension for task difficulty, we only use questions labeled as easy or hard \(i\.e\. annotated as11or55\)\.

#### Trajectory Collection\.

For each problem, we generate 10 reasoning trajectories at temperatureT=0\.7T=0\.7using two chain\-of\-thought prompting styles \(medium and long which, which are provided in the appendix\), pooled for analysis\. Each trajectory is generated autoregressively with a maximum of 800 tokens\. We extract the hidden stateht\(ℓ\)∈ℝ896h\_\{t\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{896\}at every generated token positionttfor all layersℓ∈\{0,…,23\}\\ell\\in\\\{0,\\ldots,23\\\}, yielding a trajectory matrixH\(ℓ\)H^\{\(\\ell\)\}per layer per run\. Correctness is determined by a symbolic answer checker combining SymPy expression matching and string normalization against the ground\-truth boxed answer\.

#### Features for Correctness Prediction\.

For each trajectory, we projectH\(ℓ\)H^\{\(\\ell\)\}onto the top 15 PCA components fitted on the training set, then extract seven kinematic and positional features from the windowed portion of the projected trajectory \(which are defined in section[3\.1](https://arxiv.org/html/2607.01571#S3.SS1)\): mean position, positional dispersion \(standard deviation\), initial hidden state, final hidden state of the truncated trajectories, mean velocity \(mean of successive differences\), mean speed \(mean of step\-wise norms\), and speed dispersion\. Features are standardized before classification\.

#### Evaluation Protocol\.

For correctness prediction, we use a stratified question\-level 80/20 train/test split, repeated over 5 random seeds, ensuring that all trajectories from a given question appear entirely in train or entirely in test\. We report AUC\-ROC \(Area Under the Receiver Operating Characteristic curve\) and AUPRC \(Area Under the Precision–Recall Curve\) with±\\pmstd across splits\. For classifiers we use logistic regression \(LR;ℓ2\\ell\_\{2\}regularizationC=0\.1C=0\.1\), a two\-layer MLP \(hidden sizes 64 and 32, early stopping,ℓ2\\ell\_\{2\}regularizationα=0\.01\\alpha=0\.01\), and a two\-layer GRU \(hidden dim 64\)\. We highlight AUPRC because our central question is whether trajectory geometry can rank correct solutions above incorrect ones\. AUPRC summarizes this ranking through precision \(how many of the trajectories flagged as correct actually are\) and recall \(how many of the truly correct trajectories are recovered\) across all thresholds\. This directly reflects our intended use, deciding which of the many candidate trajectories for a given question to continue, since a high\-precision, high\-recall score lets us concentrate on promising trajectories and potentially reach a correct solution faster\.

### 5\.2Experiment 1: Correctness Prediction from Trajectory Geometry

We ask whether the geometry of a reasoning trajectory predicts whether it will reach a correct answer, and how early in generation this signal emerges\.

#### Cross\-question generalization\.

Figure[2](https://arxiv.org/html/2607.01571#S5.F2)shows AUPRC and AUC\-ROC results for correctness prediction as a function of the fraction of the trajectory observed, for the LR, MLP, and GRU classifiers, under a stratified question\-level 80/20 split averaged over 5 seeds\. LR achieves AUC=0\.806±0\.132=0\.806\\pm 0\.132at20%20\\%observation and0\.839±0\.1110\.839\\pm 0\.111at100%100\\%, with AUPRC remaining stable around0\.8680\.868throughout\. MLP reaches AUC=0\.723±0\.154=0\.723\\pm 0\.154at20%20\\%, dips at30%30\\%, then recovers to0\.828±0\.0930\.828\\pm 0\.093at100%100\\%\. The GRU behaves qualitatively differently: it achieves AUC=0\.641±0\.215=0\.641\\pm 0\.215at20%20\\%, peaks at0\.754±0\.0780\.754\\pm 0\.078at70%70\\%, and then*drops*to0\.577±0\.1440\.577\\pm 0\.144at100%100\\%, barely above chance\.

![Refer to caption](https://arxiv.org/html/2607.01571v1/x2.png)
![Refer to caption](https://arxiv.org/html/2607.01571v1/x3.png)

Figure 2:Correctness is detectable from early trajectory geometry\.\(Left\)AUPRC as a function of the fraction of the trajectory observed\.\(Right\)AUC\-ROC as a function of the fraction of the trajectory observed\. Kinematic and positional features extracted from a windowed prefix of the reasoning trajectory predict whether the trajectory will reach a correct answer before generation is complete\. The train/test split is at the question level, using an 80/20 split across five seeds\.
#### Within\-question prediction\.

To understand the ceiling of the correctness signal, we also evaluate a setting where trajectories from the same question can appear in both train and test, while ensuring no trajectory is included in both \(i\.e\. there is no data leakage, but rather leakage at the question level\) with 5\-fold cross\-validation over the set of all pooled trajectories\. Here LR achieves AUC≈0\.90\\approx 0\.90and MLP achieves AUC≈0\.91\\approx 0\.91using only20%20\\%of the trajectory, with negligible improvement as more tokens are observed \(Table[1](https://arxiv.org/html/2607.01571#S5.T1)\)\. This confirms that the geometric signal for correctness is saturated very early, and that the gap between within\-question \(0\.900\.90\) and cross\-question \(0\.810\.81\) performance represents the portion of the signal that is question\-specific rather than universally transferable, or that our sample size was not high enough to generalize to such an extent\.

Table 1:Within\-question correctness prediction \(5\-fold CV\)\. AUC\-ROC±\\pmstd\.Trajectory %Logistic RegressionMLPAUCStdAUCStd20%0\.9020\.0120\.9120\.02350%0\.9020\.0160\.9140\.015100%0\.9190\.0170\.9090\.022

### 5\.3Experiment 2: Difficulty Prediction via Effective Dimension

We ask whether the effective dimensiondρd\_\{\\rho\}of the hidden\-state trajectory can predict whether a problem is easy or hard \(MATH500 difficulty of11vs\.55\), without any access to the answer or the model’s output\. For each problem, at layerℓ=12\\ell=12, we computedρ​\(H\(ℓ\),ρ\)d\_\{\\rho\}\(H^\{\(\\ell\)\},\\rho\)atρ∈\{0\.90,0\.95,0\.99\}\\rho\\in\\\{0\.90,0\.95,0\.99\\\}, yielding a three\-dimensional feature vector per trajectory\. We then run leave\-one\-question\-out cross\-validation: for each test question, we train an LR or MLP classifier on the effective dimension features from all other questions, and predict whether the test question is easy or hard\.

#### Results\.

Figure[3](https://arxiv.org/html/2607.01571#S5.F3)\(left\) shows AUC\-ROC as a function of layer for both classifiers\. Effective dimension is predictive at every layer, with AUC rising from0\.810\.81at layer 0 to0\.930\.93at layer 21\. Prediction is consistent and robust: even the earliest layers achieve AUC well above chance, and the signal strengthens through the network\. The best layer is layer 21 \(MLP AUC=0\.93=0\.93, accuracy=86\.8%=86\.8\\%\)\. Figure[3](https://arxiv.org/html/2607.01571#S5.F3)\(right\) shows the distribution ofdρd\_\{\\rho\}at layer 21 \(ρ=0\.95\\rho=0\.95\) for easy and hard problems\. The separation is apparent: hard problems have mean effective dimensionμ≈170\\mu\\approx 170versusμ=121\.6\\mu=121\.6for easy problems, a40%40\\%gap, with a t\-testpp\-value of5\.79×10−1665\.79\\times 10^\{\-166\}\. Hard trajectories explore a substantially higher\-dimensional subspace of the model’s representation space, providing strong evidence for the relationship between effective dimension and task hardness\.

![Refer to caption](https://arxiv.org/html/2607.01571v1/x4.png)
![Refer to caption](https://arxiv.org/html/2607.01571v1/x5.png)

Figure 3:Effective dimension predicts task difficulty\.\(Left\)AUC for easy vs\. hard difficulty prediction across all 24 layers\. Effective dimension achieve AUC\>0\.9\>0\.9from layer 8 onward, peaking at0\.930\.93at layer 21\.\(Right\)Distribution ofdρd\_\{\\rho\}\(ρ=0\.95\\rho=0\.95\) at layer 21\. Hard problems exhibit40%40\\%higher effective dimension than easy ones \(p=5\.79×10−166p=5\.79\\times 10^\{\-166\}\)\. The spectral geometry of the hidden\-state trajectory encodes task hardness across all layers, with the signal strengthening toward the final layers\.

## 6Conclusion

We study the geometry of chain\-of\-thought reasoning trajectories in transformer hidden state space\. Formalizing each reasoning chain as a discrete curve inℝd\\mathbb\{R\}^\{d\}, we introduced the effective dimensiondρd\_\{\\rho\}as a spectral measure of trajectory complexity and show theoretically that harder tasks necessarily induce higher\-dimensional trajectories\. Empirically,dρd\_\{\\rho\}predicts problem difficulty with AUC\>0\.93\>0\.93via leave\-one\-question\-out cross\-validation, with hard problems exhibiting40%40\\%higher effective dimension on average than easy ones\. For correctness prediction, seven kinematic features of the trajectory achieve AUC=0\.806=0\.806from only the first20%20\\%of generated tokens under a question\-level split, with a simple logistic regression outperforming a GRU on the full sequence, suggesting there are interesting signal lies in coarse geometric structure\. Together, these results establish trajectory geometry as a practical window into both task hardness and solution quality\.

Limitations\.Our experiments use a single small model and three MATH500 categories, so generalization to larger models and other domains remains open\. Our theoretical analysis connects spectral flatness to effective dimension but does not fully explain how task difficulty induces this flatness in a trained transformer\. Finally, our correctness prediction is evaluated in terms of AUC; translating this into concrete early\-stopping or best\-of\-nngains is left for future work\.

## Acknowledgments

AJ was supported in part by the NSF Award DMS\-2311024, an Amazon Faculty Research Award, an Adobe Faculty Research Award, and an iORB grant form USC Marshall School of Business\.

## References

- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2607.01571#S1.p1.1),[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.InAdvances in Neural Information Processing Systems,Vol\.34\.Cited by:[§5\.1](https://arxiv.org/html/2607.01571#S5.SS1.SSS0.Px2.p1.5)\.
- A\. Javanmard, B\. Mirzasoleiman, and V\. Mirrokni \(2025\)Understanding the role of training data in test\-time scaling\.arXiv preprint arXiv:2510\.03605\.Cited by:[§1](https://arxiv.org/html/2607.01571#S1.p1.1),[§2](https://arxiv.org/html/2607.01571#S2.p2.1),[§4\.1](https://arxiv.org/html/2607.01571#S4.SS1.p2.3)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- T\. Korbak, M\. Balesni, E\. Barnes, Y\. Bengio, J\. Benton, J\. Bloom, M\. Chen, A\. Cooney, A\. Dafoe, A\. Dragan,et al\.\(2025\)Chain of thought monitorability: a new and fragile opportunity for ai safety\.arXiv preprint arXiv:2507\.11473\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p3.1)\.
- A\. W\. Marshall, I\. Olkin, and B\. C\. Arnold \(2011\)Inequalities: theory of majorization and its applications\.2 edition,Springer Series in Statistics,Springer New York\.External Links:[Document](https://dx.doi.org/10.1007/978-0-387-68276-1),ISBN 978\-0\-387\-68276\-1Cited by:[Definition 4](https://arxiv.org/html/2607.01571#Thmdefinition4.p1.8.4)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. B\. Hashimoto \(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20286–20332\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- OpenAI \(2024\)External Links:[Link](https://openai.com/index/learning-to-reason-with-llms/)Cited by:[§1](https://arxiv.org/html/2607.01571#S1.p1.1),[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- A\. Prasad, M\. Joshi, K\. Lee, M\. Bansal, and P\. Shaw \(2026\)Effective reasoning chains reduce intrinsic dimensionality\.arXiv preprint arXiv:2602\.09276\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p6.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- J\. Su, J\. Healey, P\. Nakov, and C\. Cardie \(2025\)Between underthinking and overthinking: an empirical study of reasoning length and correctness in llms\.arXiv preprint arXiv:2505\.00127\.Cited by:[§1](https://arxiv.org/html/2607.01571#S1.p1.1),[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- L\. Sun, H\. Dong, B\. Qiao, Q\. Lin, D\. Zhang, and S\. Rajmohan \(2026\)LLM reasoning as trajectories: step\-specific representation geometry and correctness signals\.arXiv preprint arXiv:2604\.05655\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p4.3)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2023\)Steering language models with activation engineering\.arXiv preprint arXiv:2308\.10248\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p4.3)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§1](https://arxiv.org/html/2607.01571#S1.p1.1),[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- S\. Welleck, A\. Bertsch, M\. Finlayson, H\. Schoelkopf, A\. Xie, G\. Neubig, I\. Kulikov, and Z\. Harchaoui \(2024\)From decoding to meta\-generation: inference\-time algorithms for large language models\.arXiv preprint arXiv:2406\.16838\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p1.1)\.
- A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang,et al\.\(2024\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[§5\.1](https://arxiv.org/html/2607.01571#S5.SS1.SSS0.Px1.p1.1)\.
- Y\. Zhou, Y\. Wang, X\. Yin, S\. Zhou, and A\. R\. Zhang \(2025\)The geometry of reasoning: flowing logics in representation space\.arXiv preprint arXiv:2510\.09782\.Cited by:[§2](https://arxiv.org/html/2607.01571#S2.p5.1)\.

## Appendix AProofs

In the following proofs, for simplicity of the notations we usedT:=tr​\(C\)T:=\\mathrm\{tr\}\(C\)\.

### A\.1Proof of Proposition[1](https://arxiv.org/html/2607.01571#Thmtheorem1)

###### Proof\.

Letr:=dρ​\(C\)r:=d\_\{\\rho\}\(C\)\. By definition ofdρd\_\{\\rho\},rris the smallest integer with∑j=1rνj≥ρ​T\\sum\_\{j=1\}^\{r\}\\nu\_\{j\}\\geq\\rho T, so we have

∑j=1rνj≥ρ​Tand∑j=1r−1νj<ρ​T\.\\sum\_\{j=1\}^\{r\}\\nu\_\{j\}\\;\\geq\\;\\rho T\\quad\\text\{and\}\\quad\\sum\_\{j=1\}^\{r\-1\}\\nu\_\{j\}\\;<\\;\\rho T\.\(17\)
Lower bound\.Since the eigenvalues are sorted in decreasing order,νj≤ν1\\nu\_\{j\}\\leq\\nu\_\{1\}for alljj\. Summing overj=1,…,rj=1,\\ldots,r:

ρ​T≤∑j=1rνj≤r⋅ν1\.\\rho T\\;\\leq\\;\\sum\_\{j=1\}^\{r\}\\nu\_\{j\}\\;\\leq\\;r\\cdot\\nu\_\{1\}\.Dividing byν1\>0\\nu\_\{1\}\>0givesr≥ρ​T/ν1r\\geq\\rho T/\\nu\_\{1\}\. Sincerris a positive integer,r≥⌈ρ​T/ν1⌉r\\geq\\lceil\\rho T/\\nu\_\{1\}\\rceil\.

Upper bound\.Sinceνj≥νd\\nu\_\{j\}\\geq\\nu\_\{d\}for allj≤r−1j\\leq r\-1, summing:

\(r−1\)⋅νd≤∑j=1r−1νj<ρ​T,\(r\-1\)\\cdot\\nu\_\{d\}\\;\\leq\\;\\sum\_\{j=1\}^\{r\-1\}\\nu\_\{j\}\\;<\\;\\rho T,where the strict inequality is from \([17](https://arxiv.org/html/2607.01571#A1.E17)\)\. Hencer−1<ρ​T/νdr\-1<\\rho T/\\nu\_\{d\}, and integrality givesr≤⌈ρ​T/νd⌉r\\leq\\lceil\\rho T/\\nu\_\{d\}\\rceil\. ∎

### A\.2Proof of Proposition[2](https://arxiv.org/html/2607.01571#Thmtheorem2)

###### Proof\.

\(1\)ν⋆≺ν\\nu^\{\\star\}\\prec\\nu\.Both vectors have the same totalTT, so the equality condition in Def\.[4](https://arxiv.org/html/2607.01571#Thmdefinition4)holds\. We must show∑i=1kνi⋆≤∑i=1kνi\\sum\_\{i=1\}^\{k\}\\nu^\{\\star\}\_\{i\}\\leq\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}for everykk, i\.e\.,k​T/d≤∑i=1kνikT/d\\leq\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}\. Equivalently, we must show the top\-kkaverage ofν\\nuis at least the overall average:

1k​∑i=1kνi≥Td\.\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}\\;\\geq\\;\\frac\{T\}\{d\}\.
Suppose for contradiction that1k​∑i=1kνi<T/d\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}<T/d\. Sinceν\\nuis sorted decreasingly,νk≤1k​∑i=1kνi<T/d\\nu\_\{k\}\\leq\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}<T/dandνj≤νk<T/d\\nu\_\{j\}\\leq\\nu\_\{k\}<T/dfor allj≥kj\\geq k\. Then

T=∑i=1kνi⏟<k​T/d\+∑i=k\+1dνi⏟<\(d−k\)​T/d<k​Td\+\(d−k\)​Td=T,T\\;=\\;\\underbrace\{\\sum\_\{i=1\}^\{k\}\\nu\_\{i\}\}\_\{<kT/d\}\\;\+\\;\\underbrace\{\\sum\_\{i=k\+1\}^\{d\}\\nu\_\{i\}\}\_\{<\(d\-k\)T/d\}\\;<\\;\\frac\{kT\}\{d\}\+\\frac\{\(d\-k\)T\}\{d\}\\;=\\;T,a contradiction\. Hence the top\-kkaverage is at leastT/dT/d, andν⋆≺ν\\nu^\{\\star\}\\prec\\nu\.

\(2\)dρd\_\{\\rho\}is maximized at the flat spectrum\.Letr⋆:=dρ​\(C⋆\)=⌈ρ​d⌉r^\{\\star\}:=d\_\{\\rho\}\(C^\{\\star\}\)=\\lceil\\rho d\\rceil\(directly, since∑i=1r⋆νi⋆=r⋆​T/d≥ρ​T\\sum\_\{i=1\}^\{r^\{\\star\}\}\\nu^\{\\star\}\_\{i\}=r^\{\\star\}T/d\\geq\\rho Tiffr⋆≥ρ​dr^\{\\star\}\\geq\\rho d\)\. Applying part \(1\) atk=r⋆k=r^\{\\star\}:

∑i=1r⋆νi≥∑i=1r⋆νi⋆≥ρ​T\.\\sum\_\{i=1\}^\{r^\{\\star\}\}\\nu\_\{i\}\\;\\geq\\;\\sum\_\{i=1\}^\{r^\{\\star\}\}\\nu^\{\\star\}\_\{i\}\\;\\geq\\;\\rho T\.By definition ofdρd\_\{\\rho\}as the*smallest*suchrr,dρ​\(C\)≤r⋆=dρ​\(C⋆\)d\_\{\\rho\}\(C\)\\leq r^\{\\star\}=d\_\{\\rho\}\(C^\{\\star\}\)\.

Equality condition\.Equality throughout requiresν⋆≺ν\\nu^\{\\star\}\\prec\\nuto be an equality of partial sums at everykk, forcingνi=T/d\\nu\_\{i\}=T/dfor allii, i\.e\.,ν=ν⋆\\nu=\\nu^\{\\star\}\. ∎

### A\.3Proof of Theorem[3](https://arxiv.org/html/2607.01571#Thmtheorem3)

###### Proof\.

Before we prove this, we recall Weyl inequality which states, for any symmetric matricesA,B∈ℝd×dA,B\\in\\mathbb\{R\}^\{d\\times d\}with eigenvalues sorted in decreasing order,

\|λk​\(A\+B\)−λk​\(A\)\|≤‖B‖opfor all​k=1,…,d\.\\bigl\|\\lambda\_\{k\}\(A\+B\)\-\\lambda\_\{k\}\(A\)\\bigr\|\\;\\leq\\;\\\|B\\\|\_\{\\mathrm\{op\}\}\\qquad\\text\{for all \}k=1,\\ldots,d\.where‖B‖op:=max​\(\|λ1​\(B\)\|,\|λd​\(B\)\|\)\\\|B\\\|\_\{\\mathrm\{op\}\}:=\\text\{max\}\(\|\\lambda\_\{1\}\(B\)\|,\|\\lambda\_\{d\}\(B\)\|\)\. In particular, takingA=CA=CandB=C^−CB=\\widehat\{C\}\-Cgives\|ν^k−νk\|≤‖C^−C‖op\|\\hat\{\\nu\}\_\{k\}\-\\nu\_\{k\}\|\\leq\\\|\\widehat\{C\}\-C\\\|\_\{\\mathrm\{op\}\}which we use shortly\. Letν1≥⋯≥νd\\nu\_\{1\}\\geq\\cdots\\geq\\nu\_\{d\}andν^1≥⋯≥ν^d\\hat\{\\nu\}\_\{1\}\\geq\\cdots\\geq\\hat\{\\nu\}\_\{d\}be the decreasingly\-sorted eigenvalues ofCCandC^\\widehat\{C\}\. Weyl’s inequality for Hermitian matrices gives\|ν^j−νj\|≤‖C^−C‖op≤ϵ\|\\hat\{\\nu\}\_\{j\}\-\\nu\_\{j\}\|\\leq\\\|\\widehat\{C\}\-C\\\|\_\{\\mathrm\{op\}\}\\leq\\epsilonfor everyjj\. Summing:\|tr​\(C^\)−tr​\(C\)\|=\|∑j\(ν^j−νj\)\|≤d​ϵ\|\\mathrm\{tr\}\(\\widehat\{C\}\)\-\\mathrm\{tr\}\(C\)\|=\|\\sum\_\{j\}\(\\hat\{\\nu\}\_\{j\}\-\\nu\_\{j\}\)\|\\leq d\\epsilon\. Now, letSj:=∑i=1jνiS\_\{j\}:=\\sum\_\{i=1\}^\{j\}\\nu\_\{i\}andS^j:=∑i=1jν^i\\widehat\{S\}\_\{j\}:=\\sum\_\{i=1\}^\{j\}\\hat\{\\nu\}\_\{i\}\. By Weyl again,\|S^j−Sj\|≤j​ϵ≤d​ϵ\|\\widehat\{S\}\_\{j\}\-S\_\{j\}\|\\leq j\\epsilon\\leq d\\epsilon\.

DefineF​\(j\):=Sj/tr​\(C\)F\(j\):=S\_\{j\}/\\mathrm\{tr\}\(C\)andF^​\(j\):=S^j/tr​\(C^\)\\widehat\{F\}\(j\):=\\widehat\{S\}\_\{j\}/\\mathrm\{tr\}\(\\widehat\{C\}\)\. We compute

\|F^​\(j\)−F​\(j\)\|=\|S^jtr​\(C^\)−Sjtr​\(C\)\|=\|S^j​tr​\(C\)−Sj​tr​\(C^\)tr​\(C^\)​tr​\(C\)\|\.\\bigl\|\\widehat\{F\}\(j\)\-F\(j\)\\bigr\|\\;=\\;\\left\|\\frac\{\\widehat\{S\}\_\{j\}\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\}\-\\frac\{S\_\{j\}\}\{\\mathrm\{tr\}\(C\)\}\\right\|\\;=\\;\\left\|\\frac\{\\widehat\{S\}\_\{j\}\\mathrm\{tr\}\(C\)\-S\_\{j\}\\mathrm\{tr\}\(\\widehat\{C\}\)\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\\mathrm\{tr\}\(C\)\}\\right\|\.UsingS^j​tr​\(C\)−Sj​tr​\(C^\)=\(S^j−Sj\)​tr​\(C\)\+Sj​\(tr​\(C\)−tr​\(C^\)\)\\widehat\{S\}\_\{j\}\\mathrm\{tr\}\(C\)\-S\_\{j\}\\mathrm\{tr\}\(\\widehat\{C\}\)=\(\\widehat\{S\}\_\{j\}\-S\_\{j\}\)\\mathrm\{tr\}\(C\)\+S\_\{j\}\(\\mathrm\{tr\}\(C\)\-\\mathrm\{tr\}\(\\widehat\{C\}\)\)and the triangle inequality:

\|F^​\(j\)−F​\(j\)\|≤\|S^j−Sj\|tr​\(C^\)\+Sj⋅\|tr​\(C\)−tr​\(C^\)\|tr​\(C^\)​tr​\(C\)≤d​ϵtr​\(C^\)\+d​ϵ⋅Sjtr​\(C^\)​tr​\(C\)\.\\bigl\|\\widehat\{F\}\(j\)\-F\(j\)\\bigr\|\\;\\leq\\;\\frac\{\|\\widehat\{S\}\_\{j\}\-S\_\{j\}\|\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\}\+\\frac\{S\_\{j\}\\cdot\|\\mathrm\{tr\}\(C\)\-\\mathrm\{tr\}\(\\widehat\{C\}\)\|\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\\mathrm\{tr\}\(C\)\}\\;\\leq\\;\\frac\{d\\epsilon\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\}\+\\frac\{d\\epsilon\\cdot S\_\{j\}\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\\mathrm\{tr\}\(C\)\}\.
SinceSj≤tr​\(C\)S\_\{j\}\\leq\\mathrm\{tr\}\(C\), the second term is at mostd​ϵ/tr​\(C^\)d\\epsilon/\\mathrm\{tr\}\(\\widehat\{C\}\)\. Alsotr​\(C^\)≥tr​\(C\)−d​ϵ≥tr​\(C\)/2\\mathrm\{tr\}\(\\widehat\{C\}\)\\geq\\mathrm\{tr\}\(C\)\-d\\epsilon\\geq\\mathrm\{tr\}\(C\)/2\(using the assumptiond​ϵ≤tr​\(C\)/2d\\epsilon\\leq\\mathrm\{tr\}\(C\)/2\)\. Therefore

\|F^​\(j\)−F​\(j\)\|≤2​d​ϵtr​\(C^\)≤4​d​ϵtr​\(C\)\.\\bigl\|\\widehat\{F\}\(j\)\-F\(j\)\\bigr\|\\;\\leq\\;\\frac\{2d\\epsilon\}\{\\mathrm\{tr\}\(\\widehat\{C\}\)\}\\;\\leq\\;\\frac\{4d\\epsilon\}\{\\mathrm\{tr\}\(C\)\}\.
By definition,dρ​\(C\)=min⁡\{j:F​\(j\)≥ρ\}d\_\{\\rho\}\(C\)=\\min\\\{j:F\(j\)\\geq\\rho\\\}and similarly forC^\\widehat\{C\}\. IfFFandF^\\widehat\{F\}differ by at mostδ:=4​d​ϵ/tr​\(C\)\\delta:=4d\\epsilon/\\mathrm\{tr\}\(C\)at every index, then the two threshold\-crossings can differ only at indices whereFFis withinδ\\deltaofρ\\rho\. \(Formally: if\|F​\(j\)−ρ\|\>δ\|F\(j\)\-\\rho\|\>\\delta, thenF​\(j\)\>ρ\+δ⇒F^​\(j\)\>ρF\(j\)\>\\rho\+\\delta\\Rightarrow\\widehat\{F\}\(j\)\>\\rho, orF​\(j\)<ρ−δ⇒F^​\(j\)<ρF\(j\)<\\rho\-\\delta\\Rightarrow\\widehat\{F\}\(j\)<\\rho; either wayFFandF^\\widehat\{F\}agree on “reached threshold atjj” at that index\.\) So the difference\|dρ​\(C^\)−dρ​\(C\)\|\|d\_\{\\rho\}\(\\widehat\{C\}\)\-d\_\{\\rho\}\(C\)\|is bounded by the number of indicesjjwithin\|F​\(j\)−ρ\|≤δ\|F\(j\)\-\\rho\|\\leq\\delta\.

∎

## Appendix BExperimental Details

#### Hardware\.

Trajectory collection and analysis were run on NVIDIA A100 GPUs \(40 GB HBM2\) via a SLURM cluster\. Collection jobs used 16 GB of CPU RAM per job; effective dimension analysis used 24 GB; correctness prediction ran on CPU\-only nodes with 32 GB of RAM\.

#### Hyperparameters\.

- •Temperature:T=0\.7T=0\.7
- •Maximum tokens: 800
- •Runs per question: 10–15
- •Minimum tokens for valid trajectory: 30
- •ρ\\rhovalues:\{0\.90,0\.95,0\.99\}\\\{0\.90,0\.95,0\.99\\\}

#### Difficulty Prediction Classifier Details\.

- •Logistic Regression:C=1\.0C=1\.0, max iterations = 1000
- •MLP: Hidden layers\(32,16\)\(32,16\), early stopping with 15% validation,ℓ2\\ell\_\{2\}regularizationα=0\.01\\alpha=0\.01

#### Correctness Prediction Classifier Details\.

- •Logistic Regression:C=0\.1C=0\.1, max iterations = 1000, balanced class weights
- •MLP: Hidden layers\(64,32\)\(64,32\), early stopping with 15% validation,ℓ2\\ell\_\{2\}regularizationα=0\.01\\alpha=0\.01, balanced sample weights
- •GRU: 2\-layer, hidden dim 64, dropout 0\.3, Adam optimizer \(lr=1​e−31e\-3, weight decay=1​e−41e\-4\), 30 epochs, batch size 32, BCE loss with positive class weighting, gradient clipping norm1\.01\.0

#### Prompts\.

We use three chain\-of\-thought prompting styles\. Themediumandlongstyles are pooled for all experiments reported in the main paper\. Theshortstyle is included for completeness\.

Short CoT PromptGo with your instinct\. Write only the essential steps \- no extra explanation\. Answer in\\boxed\{\}\.

Medium CoT PromptThink step by step\. Show your reasoning, then give the final answer in\\boxed\{\}\.

Long CoT PromptOverthink this\. Consider multiple approaches\. Solve step by step, then second\-guess yourself\. Check your work using a different method\. Ask: what could I be missing? What if I made an error? Keep thinking until you’re fully satisfied\. Answer in\\boxed\{\}\.

Similar Articles

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

arXiv cs.LG

A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Hugging Face Daily Papers

Proposes the Bag of Dims framework showing that the standard basis of transformer hidden states provides a training-free, architecture-general feature representation where dimensions encode semantic content via sign patterns; validated across language, vision, and audio models, achieving high accuracy with no learned rotations.

Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL

This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.