When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

arXiv cs.LG Papers

Summary

This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.

arXiv:2605.22873v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:55 AM

# When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
Source: [https://arxiv.org/html/2605.22873](https://arxiv.org/html/2605.22873)
Wei Xia2,1,Haoqing Wang1, Yehui Tang1and Zhi\-Hong Deng2​✉\{\}^\{2\{~\\textrm\{\{\\char 0\\relax\}\}\}\} 1Samsung Research, Beijing, China 2State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University xwisawesome@stu\.pku\.edu\.cn,zhdeng@pku\.edu\.cn \{haoqing\.wang, yehui\.tang\}@samsung\.com ✉\{\}^\{\\textrm\{\{\\char 0\\relax\}\}\}Corresponding Author

###### Abstract

Chain\-of\-thought \(CoT\) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open\-ended tasks while multiplying token consumption\. In this work, we show that LLM reasoning is not a static property of tasks or models, but a*dynamic decoding state*that emerges during generation\. Through systematic analysis, we find early\-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns\. This behavior can be interpreted as a phase\-transition\-like shift from a high\-entropy exploratory regime to a low\-entropy structured reasoning regime\. Based on these insights, we proposeEDRM\(Entropy Dynamics\-based Reasoning Manifold\), a lightweight and training\-free routing framework that leverages early decoding entropy to adaptively select inference strategies\. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero\-shot deployment and fine\-grained instance\-level adaptation\. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines\. At the dataset level, EDRM achieves41–55%token reduction while improving accuracy with as few as 50 calibration samples\. At the instance level, it further improves accuracy by up to4\.7%while maintaining27–45%token savings\. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy\-driven decoding control for efficient and adaptive LLM inference\.

## 1Introduction

Chain\-of\-thought \(CoT\) reasoning has emerged as a powerful paradigm for eliciting complex reasoning capabilities from large language models \(LLMs\)\. Yet a growing body of evidence reveals a striking paradox: the very mechanism that unlocks multi\-step problem\-solving can degrade performance on tasks that demand direct recall or fluent generationLiuet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib24);[2025](https://arxiv.org/html/2605.22873#bib.bib19)\)\. This asymmetry poses a practical dilemma: CoT is increasingly deployed as a default strategy, despite its benefits being highly contingent on task type, model capability, and even individual query characteristics\. The resulting inefficiencies are substantial: inflated token costs, increased latency, and error propagation over extended generation horizons\. More fundamentally, this raises the question:*when should a model engage explicit reasoning, and how can this decision be made reliably at inference time?*

Understanding when reasoning helps requires answering two preliminary questions: whether reasoning is a static capability or a dynamic processKimet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib35)\); Liet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib21)\); Zhao \([2026](https://arxiv.org/html/2605.22873#bib.bib37)\), and whether its utility is task\-intrinsic or jointly determined by model and problem complexitySpragueet al\.\([2024b](https://arxiv.org/html/2605.22873#bib.bib2)\); Dinget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib31)\)\. Prior worksSuiet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib26)\); Suet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib25)\)have explored both dimensions, leading to routing and intervention strategies that adaptively invoke CoT\. But they typically rely on offline profiling or heavy token\-level modifications, lacking a reliable instance\-level signal available at inference time\. In this work, we instead view reasoning as a*dynamic decoding state*that emerges from the interaction between a specific pair of model and query\. Crucially, this state is not directly observable from inputs, but unfolds during generation\.

This conceptual shift leads to a natural question: can early decoding dynamics provide a reliable signal for subsequent reasoning? We investigate this through the lens of token\-level entropy, which quantifies the uncertainty in the model’s next\-token distribution\. Our analysis reveals that early\-stage entropy dynamics provide a lightweight and reliable signal for this latent state\. We observe that tasks benefiting from CoT exhibit a consistent entropy reduction pattern, while low\-gain tasks show unstable or increasing entropy trajectories\. These distinct dynamics arise even under the same model, indicating that reasoning is not a fixed property but an emergent behavior conditioned on the model\-task pair\. Interestingly, this behavior exhibits characteristics analogous to a*phase transition*: the decoding process shifts from a high\-entropy exploratory regime to a low\-entropy structured reasoning regime when explicit reasoning becomes beneficial\. This perspective suggests that reasoning is not a binary capability, but a controllable transition in the model’s generation dynamics\.

Based on this observation, we proposeEDRM\(Entropy Dynamics\-based Reasoning Manifold\), a training\-free routing framework that leverages early decoding entropy to enable instance\-adaptive inference\. EDRM embeds entropy trajectories into a compact manifold space, allowing the model to select appropriate reasoning strategies on the fly\. Across 15 benchmarks and 4 LLMs, EDRM reduces token usage by27–55%while maintaining or improving accuracy of LLM models\.

Contributions\.\(1\) We conceptualize LLM reasoning as a*dynamic decoding state*, and further interpret its emergence as a*phase\-transition\-like shift*in entropy dynamics, providing a principled view of when explicit reasoning becomes beneficial\. \(2\) We introduceEDRM, a simple yet effective framework for training\-free, instance\-level adaptive reasoning via entropy\-based routing\. \(3\) We demonstrate consistent improvements in both efficiency and accuracy across diverse models and benchmarks, highlighting the practical potential of entropy\-driven decoding control\.

## 2Related Works

##### Staticvs\.Dynamic Views of LLM Reasoning\.

Research on LLM reasoning evolves along two paradigms\. The static view treats reasoning as a fixed property of models or tasks: mechanistic interpretability identifies stable neural circuits \(e\.g\., reasoning heads / circuits\) as the underlying mechanismKimet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib35)\); Conmyet al\.\([2023](https://arxiv.org/html/2605.22873#bib.bib38)\), and CoT is framed as a static capability unlocked via promptingLiet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib21)\); Wang and Zhou \([2024](https://arxiv.org/html/2605.22873#bib.bib28)\), disregarding generation dynamics\. In contrast, the dynamic view models reasoning as a conditional, emergent state during decoding: semantic entropy serves as a hallucination risk indicatorFarquharet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib20)\), and monotonic entropy decay signals reliable reasoning while oscillation predicts failureZhao \([2026](https://arxiv.org/html/2605.22873#bib.bib37)\); Zhuet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib40)\), inspiring entropy\-aware decoding and adaptive injectionSuet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib25)\); Jinet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib39)\); Heet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib45)\)\. Recent work also quantifies reasoning effort via deep\-thinking tokensChenet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib36)\)and distinguishes reasoning from recall via activation patternsFartaleet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib22)\)\.

##### Task\-Onlyvs\.Task\-Model Co\-Dependency\.

Early work adopts a task\-centric view, attributing reasoning utility solely to task type—beneficial for mathematical reasoning but limited or even harmful for factual retrievalSpragueet al\.\([2024b](https://arxiv.org/html/2605.22873#bib.bib2)\); Liuet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib24)\)\. In contrast, recent studies show that reasoning is jointly determined by task difficulty and model capability: stronger models can directly solve instances that require extensive CoT for weaker ones\. A range of methods instantiate this view by aligning problem complexity with compute via task–model calibration, including item response theoryFernandezet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib30)\), budget\-aware routingDinget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib31)\), task\-aware adaptationLiuet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib41)\), and extensions to hybrid reasoningJianget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib33)\), mixture\-of\-expertsFein\-Ashleyet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib34)\), and task decompositionShaoet al\.\([2025b](https://arxiv.org/html/2605.22873#bib.bib29);[a](https://arxiv.org/html/2605.22873#bib.bib43)\); Qiet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib44)\)\. However, these approaches rely on static offline profiling and cannot adapt decisions using real\-time decoding signals\. We address this limitation by incorporating instance\-level routing based on decoding dynamics\.

##### Adaptive Efficient Reasoning\.

Existing adaptive methods face three core limitations\. \(1\) High training overhead: approaches requiring path guidesSuiet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib26)\), co\-evolved routersHuanget al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib1)\), or synthetic dataLiuet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib41)\)incur substantial deployment costs and hinder cross\-model generalization\. \(2\) Complex token\-level intervention: per\-token speculative decodingSuet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib25)\), position\-specific trigger injectionJinet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib39)\), and real\-time entropy modulationHeet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib45)\)introduce operational fragility and impede seamless integration\. \(3\) Static precomputation: methods like RADARFernandezet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib30)\)and BEST\-RouteDinget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib31)\)rely on offline difficulty profiling, failing to capture instance\-level dynamics\. While lightweight entropy\-driven approachesSharma and Chopra \([2025](https://arxiv.org/html/2605.22873#bib.bib32)\); Zhao \([2026](https://arxiv.org/html/2605.22873#bib.bib37)\); Zhuet al\.\([2026](https://arxiv.org/html/2605.22873#bib.bib40)\)and hybrid expert systemsJianget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib33)\); Fein\-Ashleyet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib34)\)reduce overhead, they lack structured reasoning representations or dynamic model–task coupling\. Fixed CoT induces compounding errorsGanet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib42)\), and test\-time scaling theoriesSnellet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib27)\)remain architecturally ungrounded\. Token SignatureLiuet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib19)\)is the most similar to us but with only routing to Cot or Direct\. The comparison between us is in experiments part, where ERDM wins by a large margin\. EDRM uniquely combines training\-free dynamic detection via entropy dynamics with a compact reasoning manifold, achieving27–55%token reduction while improving accuracy on base models—without architectural modifications or extensive training\.

![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/cot_gain_main.png)
Figure 1:CoT Acc Gains\. Accuracy improvement of CoT over Direct across models and tasks\. Comprehensive results are in Appendix[A\.2](https://arxiv.org/html/2605.22873#A1.SS2)\.![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/heatmap_base0504.png)
Figure 2:Unified Gain Heatmap\. Each cell shows the unified gain ofCoT\-Directfor instances in a specific region of the\(Vsp/avnr,SH\)\(V\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\},S\_\{H\}\)space\.
![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/entropy_trend_smoothed_final.png)Figure 3:Entropy Trajectories: average token\-level entropy over the firstNNtokens under Standard probing\. Tasks of high CoT gain show decreasing trend while low ones show oscillation or increase\.

## 3Methodology

In this section, we first present preliminary concepts and then our observations and insights about LLM decoding dynamics and their relationship to reasoning utility in our exploring investigation\. Finally, we introduce EDRM, a novel framework that leverages early\-stage entropy dynamics to adaptively route inference strategies for efficient and effective reasoning\.

### 3\.1Preliminaries

##### Decoding paradigms\.

We consider three basic decoding paradigms under identical task descriptions:

- •Direct: the model is instructed to output the final answer directly without explicit reasoning steps\. We employ this by prompting the model to answer directly without explanation, while for thinking\-oriented models we need to close the think mode to prevent over\-reasoning additional\. This paradigm is efficient but may fail on tasks requiring multi\-step decomposition\.
- •Standard: the model is instructed to answer the query with merely the query and the minimal prompting required to elicit its intrinsic reasoning behavior, while for thinking\-oriented models we close the think mode still\. This paradigm allows the model to dynamically determine its reasoning strategy based on the query, without forcing explicit CoT or suppressing reasoning entirely\. And we utlize this paradigm for subsequentprobingandmanifold construction, as it best reflects the model’s natural decoding dynamics without heavy intervention\.
- •CoT: the model is instructed to fullfill explicit step\-by\-step reasoning with CoT prompts and think mode on \(if available\)\. This paradigm encourages the model to decompose complex problems into intermediate steps, served as the heavyest reasoning intensity approach\. While potentially improving accuracy on complex tasks, this approach incurs substantial token overhead and may degrade performance on some tasks\.

These paradigms represent a spectrum from minimal intervention \(Direct\) to mandatory reasoning \(CoT\), with Standard occupying an intermediate position that allows the model’s intrinsic behavior to manifest\. For more details about the prompting templates and settings, please refer to Appendix[B\.2](https://arxiv.org/html/2605.22873#A2.SS2)\.

##### Autoregressive generation and token\-level entropy\.

Consider an autoregressive LLM that generates tokens sequentially\. At each decoding stepii, the model produces a probability distributionpip\_\{i\}over the vocabulary𝒱\\mathcal\{V\}conditioned on the input context and previously generated tokens\. The*token\-level entropy*at stepiiis defined as:

Hi=−∑v∈𝒱pi​\(v\)​log⁡pi​\(v\),H\_\{i\}=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{i\}\(v\)\\log p\_\{i\}\(v\),\(1\)which quantifies the uncertainty in the model’s next\-token prediction\. Low entropy indicates that the model is confident about the next token \(convergent state\), while high entropy suggests uncertainty or exploration across multiple plausible continuations\. Throughout generation, the entropy trajectory\{Hi\}i=1N\\\{H\_\{i\}\\\}\_\{i=1\}^\{N\}captures the dynamics of how the model’s uncertainty evolves, providing a window into its reasoning process\.

##### Sequential statistics for entropy trajectory characterization\.

To systematically characterize entropy dynamics, we introduce two generic statistics for a scalar sequence\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}:

Spearman correlation\.Originally introduced as a non\-parametric metric, the Spearman correlation quantifies the monotonic relationship between two variables by evaluating their rank orders rather than raw values\. In our framework, it serves to robustly capture the global directional trend of entropy trajectories, remaining invariant to local non\-linear fluctuations or extreme outliers during LLM decoding\. Formally, for a sequence\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}, the Spearman correlation with respect to the step index is computed as:

Spearman​\(\{1,…,N\},\{x1,…,xN\}\)=corr​\(rank​\(\{1,…,N\}\),rank​\(\{x1,…,xN\}\)\)\\text\{Spearman\}\(\\\{1,\\dots,N\\\},\\\{x\_\{1\},\\dots,x\_\{N\}\\\}\)=\\mathrm\{corr\}\(\\mathrm\{rank\}\(\\\{1,\\dots,N\\\}\),\\mathrm\{rank\}\(\\\{x\_\{1\},\\dots,x\_\{N\}\\\}\)\)\(2\)When applied to entropy trajectories, a negative Spearman correlation indicates progressive uncertainty reduction—a signal of convergent reasoning—while positive or near\-zero values suggest unstable or exploratory dynamics\.

von Neumann ratio \(VNR\)\.Rooted in time\-series analysis, the von Neumann ratio traditionally evaluates serial dependency by contrasting successive differences against overall variance\. Here, it functions as a complementary metric to measure the local smoothness and volatility of reasoning steps, effectively distinguishing stable trajectories from highly oscillatory behavior\. Formally, for a sequence\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}, the VNR is defined as:

VNR​\(x\)=1N−1​∑i=1N−1\(xi\+1−xi\)21N​∑i=1N\(xi−x¯\)2\+ϵ\\text\{VNR\}\(x\)=\\frac\{\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\-1\}\(x\_\{i\+1\}\-x\_\{i\}\)^\{2\}\}\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(x\_\{i\}\-\\bar\{x\}\)^\{2\}\+\\epsilon\}\(3\)wherex¯=1N​∑i=1Nxi\\bar\{x\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}x\_\{i\}andϵ\\epsilonis a small constant for numerical stability\. For entropy trajectories, a small VNR indicates smooth, stable dynamics, while large VNR suggests oscillatory behavior that may undermine reliable reasoning\. The VNR complements the Spearman correlation by capturing*how consistently*a trend manifests, rather than just its direction\.

### 3\.2Systematic Analysis

We begin with two empirical observations \(Figure[1](https://arxiv.org/html/2605.22873#S2.F1)\)\. First, CoT reasoning is not universally beneficial: on a substantial subset of tasks, it yields marginal or even negative gains while significantly increasing token consumption\. Second, even on the same benchmark, different models can exhibit opposite CoT gains, suggesting that reasoning utility is not an intrinsic property of the task itself, but emerges from the interaction between the model and the task\.

Motivated by these observations, we study when reasoning becomes beneficial from a decoding\-dynamics perspective\. Our hypothesis is that successful reasoning corresponds to progressive uncertainty reduction during decoding, whereas unstable reasoning remains exploratory and prone to drift\. To examine this behavior, we analyze entropy trajectories over the firstN=64N\{=\}64decoding steps under the Standard setting \(Figure[3](https://arxiv.org/html/2605.22873#S2.F3)\)\. We observe 3 consistent patterns: \(1\) tasks with strong CoT gains typically exhibit a clear downward entropy trend; \(2\) tasks with weak or negative CoT gains more often show oscillatory or increasing entropy; \(3\) the same model can display substantially different entropy dynamics across tasks, while different models may exhibit opposite trends on the same task\. These findings suggest that reasoning is fundamentally a dynamic property of the coupled\(M,Q\)\(M,Q\)pair rather than a fixed capability\.

Interestingly, the observed behavior exhibits characteristics analogous to a*phase transition*: when explicit reasoning is beneficial, decoding gradually shifts from a high\-entropy exploratory regime toward a lower\-entropy structured reasoning regime\. In contrast, unstable trajectories often fail to enter such a convergent state, making aggressive reasoning inefficient or even harmful\.To further characterize these dynamics, we summarize entropy trajectory using 3 complementary descriptors:

SH=∑i=1NHi,Vsp=Spearman​\(\{1,…,N\},\{H1,…,HN\}\),avnr=VNR​\(Hi\)S\_\{H\}=\\sum\_\{i=1\}^\{N\}H\_\{i\},\\quad V\_\{\\text\{sp\}\}=\\text\{Spearman\}\(\\\{1,\\dots,N\\\},\\\{H\_\{1\},\\dots,H\_\{N\}\\\}\),\\quad a\_\{\\text\{vnr\}\}=\\text\{VNR\}\(H\_\{i\}\)\(4\)
Empirically, we observe several consistent geometric patterns in this low\-dimensional entropy space:

- •Convergent reasoning regimes\.Tasks with strong CoT gains typically exhibit negativeVspV\_\{\\text\{sp\}\}and relatively smallavnra\_\{\\text\{vnr\}\}, corresponding to stable entropy reduction during decoding\.
- •Exploratory reasoning regimes\.Tasks with weak or negative CoT gains often show positive or oscillatory entropy trends, yielding largerVsp/avnrV\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\}and indicating unstable reasoning dynamics\.
- •Uncertainty\-overload regimes\.Excessively large cumulative entropySHS\_\{H\}often indicates that decoding remains highly uncertain even in early stages, where CoT is unlikely to reliably converge\.

These regimes exhibit clear separability with respect to reasoning utility \(Figure[2](https://arxiv.org/html/2605.22873#S2.F2)\), indicating that early\-stage entropy dynamics naturally organize behaviors into structured low\-dimensional manifolds\.

Core Insights\.LLM reasoning emerges as distinct entropy\-dynamic regimes during decoding, where beneficial reasoning trajectories exhibit structured convergence and form separable manifolds for adaptive control\.

### 3\.3Entropy Dynamics\-based Reasoning Manifold \(EDRM\)

Motivated by the above insights, we formulate adaptive reasoning as a*decoding\-state routing*problem\. Rather than treating reasoning as a fixed capability or invoking CoT indiscriminately, EDRM views each query as occupying a latent reasoning regime that emerges dynamically during decoding\. The key idea is that early\-stage entropy dynamics already contain sufficient information to infer whether explicit reasoning is likely to converge productively or drift into unstable exploration\.

##### Framework overview\.

Given a modelMMand queryQQ, EDRM first performs a short probing decode under theStandardsetting and records the next\-token distributions during the firstNNdecoding steps\. From the resulting entropy trajectory\{Hi\}i=1N\\\{H\_\{i\}\\\}\_\{i=1\}^\{N\}, we construct a compact reasoning\-state representation𝐳​\(Q\)=\(SH,Vsp,avnr\)\\mathbf\{z\}\(Q\)=\(S\_\{H\},\\;V\_\{\\text\{sp\}\},\\;a\_\{\\text\{vnr\}\}\)computed as in Eq\. \([4](https://arxiv.org/html/2605.22873#S3.E4)\) using the generic statistics in Eq\. \([2](https://arxiv.org/html/2605.22873#S3.E2)\) and Eq\. \([3](https://arxiv.org/html/2605.22873#S3.E3)\)\. Importantly, we do not treat𝐳​\(Q\)\\mathbf\{z\}\(Q\)as three arbitrary numbers\. Under our decoding\-dynamics story, the three coordinates correspond to three complementary aspects of the emerging reasoning regime: \(1\) uncertainty load \(SHS\_\{H\}\), \(2\) convergence direction \(VspV\_\{\\text\{sp\}\}\), and \(3\) stability \(avnra\_\{\\text\{vnr\}\}\)\. Intuitively, a negativeVspV\_\{\\text\{sp\}\}indicates progressive uncertainty reduction \(entering a structured reasoning phase\), while a largeavnra\_\{\\text\{vnr\}\}indicates oscillation that makes any apparent trend less reliable; a largeSHS\_\{H\}reflects an early uncertainty\-overload regime where additional reasoning steps are unlikely to settle\. Under this interpretation, queries with similar entropy\-dynamic behaviors naturally cluster into nearby regions of this low\-dimensional space, forming theEntropy Dynamics\-based Reasoning Manifold\. Unlike conventional routing approaches that rely on external classifiers, offline profiling, or heavy token\-level interventions, EDRM derives routing signals directly from intrinsic decoding dynamics\. This makes the core pipeline lightweight, training\-free, and naturally compatible with different model families and prompting strategies\.

##### Adaptive reasoning routing\.

Based on the manifold representation, EDRM adaptively selects amongDirect,Standard, andCoTdecoding\. The routing policy follows directly from the empirical regimes summarized in the previous subsection:

ℳ​\(Q\)=\{Direct,Vsp\>k⋅avnr∨\(Vsp\>0∧SH\>SH,th\),CoT,Vsp<−k⋅avnr,Standard,otherwise\.\\mathcal\{M\}\(Q\)=\\begin\{cases\}\\text\{Direct\},&V\_\{\\text\{sp\}\}\>k\\cdot a\_\{\\text\{vnr\}\}\\;\\lor\\;\(V\_\{\\text\{sp\}\}\>0\\land S\_\{H\}\>S\_\{H,\\text\{th\}\}\),\\\\ \\text\{CoT\},&V\_\{\\text\{sp\}\}<\-k\\cdot a\_\{\\text\{vnr\}\},\\\\ \\text\{Standard\},&\\text\{otherwise\}\.\\end\{cases\}\(5\)Intuitively, strongly negativeVspV\_\{\\text\{sp\}\}indicates stable entropy reduction and progressive convergence, where explicit reasoning is more likely to improve performance\. In contrast, positive or highly oscillatory entropy dynamics indicate unstable exploration and elevated drift risk, favoring conservative decoding strategies\. The additional threshold condition onSHS\_\{H\}captures uncertainty\-overload regimes, where excessive early\-stage uncertainty often prevents CoT from reliably converging\. After selectingℳ​\(Q\)\\mathcal\{M\}\(Q\), EDRM performs the actual answer generation using the chosen decoding paradigm\. Importantly, the router is not designed around task categories or manually curated heuristics\. Instead, routing decisions emerge directly from the model’s own decoding dynamics, enabling EDRM to generalize across heterogeneous tasks and model scales using a unified dynamical criterion\.

##### Learned manifold router\.

The heuristic router above is the default, training\-free instantiation\. For finer\-grained adaptation under distribution shift, EDRM also supports an*optional*lightweight learned router\. Specifically, we train a three\-layer MLP on a small calibration set, where labels are constructed by evaluating all decoding modes and selecting the utility\-optimal strategy according to task\-specific objectives\. The learned router predictsℳ​\(Q\)=arg⁡maxm⁡pθ​\(m∣𝐳​\(Q\)\)\\mathcal\{M\}\(Q\)=\\arg\\max\_\{m\}p\_\{\\theta\}\(m\\mid\\mathbf\{z\}\(Q\)\)using either the compact manifold representation𝐳​\(Q\)\\mathbf\{z\}\(Q\)or the full entropy trajectory as input\. Since the manifold already provides strong low\-dimensional structure, the learned router remains lightweight and requires only a small number of calibration samples\.

Algorithm 1Entropy Dynamics\-based Routing \(EDRM\-Global and EDRM\-Heuristic\)1:Input

xx\(sample or dataset\), probe length

NN, hyper\-parameters

\(k,SH,th\)\(k,S\_\{H,\\text\{th\}\}\)
2:Inference mode

M∈\{Direct,Standard,CoT\}M\\in\\\{\\text\{Direct\},\\text\{Standard\},\\text\{CoT\}\\\}
3:ifdataset\-level routingthen

4:Compute statistics

\(SH,Vsp,avnr\)←\(S¯H,V¯sp,a¯vnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)\\leftarrow\(\\bar\{S\}\_\{H\},\\bar\{V\}\_\{\\text\{sp\}\},\\bar\{a\}\_\{\\text\{vnr\}\}\)
5:else⊳\\trianglerightinstance\-level routing

6:Run probing on

xx→\\rightarrowentropy

ExE\_\{x\}
7:Compute

SH​\(x\),Vsp​\(x\),avnr​\(x\)S\_\{H\}\(x\),V\_\{\\text\{sp\}\}\(x\),a\_\{\\text\{vnr\}\}\(x\)
8:endif

9:if

Vsp\>k⋅avnrV\_\{\\text\{sp\}\}\>k\\cdot a\_\{\\text\{vnr\}\}or

\(Vsp\>0​and​SH\>SH,th\)\(V\_\{\\text\{sp\}\}\>0\\;\\textbf\{and\}\\;S\_\{H\}\>S\_\{H,\\text\{th\}\}\)then

10:

M←DirectM\\leftarrow\\text\{Direct\}
11:elseif

Vsp<−k⋅avnrV\_\{\\text\{sp\}\}<\-k\\cdot a\_\{\\text\{vnr\}\}then

12:

M←CoTM\\leftarrow\\text\{CoT\}
13:else

14:

M←StandardM\\leftarrow\\text\{Standard\}
15:endif

16:return

MM

##### Deployment scenarios\.

EDRM supports both dataset\-level and instance\-level deployment\. For low\-resource or cold\-start settings, EDRM estimates global manifold statistics from a small calibration subset to initialize routing hyper\-parameters without additional training\. This provides a lightweight initialization mechanism for new datasets or models\. For instance\-level deployment, each query independently undergoes probing, manifold embedding, and adaptive routing, enabling fine\-grained control over the trade\-off between reasoning quality and token efficiency during inference\. The working pipelines are presented in Algorithm[1](https://arxiv.org/html/2605.22873#alg1)\.

##### CalibratingSH,thS\_\{H,\\text\{th\}\}\(practical step\)\.

In the heuristic router,SH,thS\_\{H,\\text\{th\}\}flags instances whose early decoding exhibits excessive cumulative uncertainty\. We consider two ways to setSH,thS\_\{H,\\text\{th\}\}: \(1\)Empirical thresholds: We setSH,th=32S\_\{H,\\text\{th\}\}\{=\}32for base models andSH,th=10S\_\{H,\\text\{th\}\}\{=\}10for reasoning models as they are the bin\-search optimal\. \(2\)Cross\-dataset calibration:SH,thS\_\{H,\\text\{th\}\}is adaptively set based on dataset\-level entropy trends\. Since datasets with average convergence \(Vsp<0V\_\{\\text\{sp\}\}<0\) benefit from CoT while divergent ones \(Vsp≥0V\_\{\\text\{sp\}\}\\geq 0\) risk drift, we set the threshold to theMM\-th smallest dataset\-mean cumulative entropy:SH,th=⌊S\(M\)⌋S\_\{H,\\text\{th\}\}=\\lfloor S\_\{\(M\)\}\\rfloor, whereM=\|\{j:Vsp\(j\)<0\}\|M=\|\\\{j:V\_\{\\text\{sp\}\}^\{\(j\)\}<0\\\}\|,S\(M\)S\_\{\(M\)\}denotes theMM\-th order statistic of\{SH\(j\)\}\\\{S\_\{H\}^\{\(j\)\}\\\}\. We provide the full definition, boundary handling, and pseudocode in Appendix[A\.4\.1](https://arxiv.org/html/2605.22873#A1.SS4.SSS1)\.

##### Other details about EDRM

We provide comprehensive technical details and extended analyses in Appendix[A](https://arxiv.org/html/2605.22873#A1)\. Appendix[A\.1](https://arxiv.org/html/2605.22873#A1.SS1)formally defines the three entropy descriptors and justifies design choices\. Appendix[A\.2](https://arxiv.org/html/2605.22873#A1.SS2)extends CoT\-vs\-Direct comparison to all model–benchmark pairs, showing substantial CoT gains on math tasks but marginal or negative gains on commonsense tasks\. Appendix[A\.3](https://arxiv.org/html/2605.22873#A1.SS3)visualizes entropy trajectories across all settings\. Appendix[A\.4](https://arxiv.org/html/2605.22873#A1.SS4)introduces the Unified Gain metric and heatmaps that link CoT utility to manifold coordinates\. Appendix[A\.4\.1](https://arxiv.org/html/2605.22873#A1.SS4.SSS1)details the cross\-dataset threshold calibration procedure\. Appendix[A\.5](https://arxiv.org/html/2605.22873#A1.SS5)analyzes the three\-branch routing design and decision boundaries\. Appendix[A\.6](https://arxiv.org/html/2605.22873#A1.SS6)describes the learned MLP router variants and training setup\.

## 4Experiments

### 4\.1Experimental Setup

We evaluate EDRM on 15 benchmarks spanning diverse reasoning types and difficulty levels, with 4 different LLMs to validate cross\-model generalization\. Implementation details are as follows\.

Datasets\.We categorize the 15 benchmarks into 4 groups\. \(1\) Mathematical reasoning:gsm8kCobbeet al\.\([2021](https://arxiv.org/html/2605.22873#bib.bib3)\),MultiArithRoy and Roth \([2016](https://arxiv.org/html/2605.22873#bib.bib4)\), andbbhSrivastavaet al\.\([2022](https://arxiv.org/html/2605.22873#bib.bib5)\)\. \(2\) Commonsense & knowledge reasoning:commonsenseqaSpeeret al\.\([2016](https://arxiv.org/html/2605.22873#bib.bib6)\),strategyqaGevaet al\.\([2021](https://arxiv.org/html/2605.22873#bib.bib15)\),piqaBisket al\.\([2019](https://arxiv.org/html/2605.22873#bib.bib18)\),siqaSapet al\.\([2019](https://arxiv.org/html/2605.22873#bib.bib17)\), andMuSRSpragueet al\.\([2024a](https://arxiv.org/html/2605.22873#bib.bib16)\)\. \(3\) Scientific reasoning:arc\_challengeClarket al\.\([2018](https://arxiv.org/html/2605.22873#bib.bib7)\),arc\_easyClarket al\.\([2018](https://arxiv.org/html/2605.22873#bib.bib7)\), andgpqaReinet al\.\([2023](https://arxiv.org/html/2605.22873#bib.bib8)\)\. \(4\) Formal logic:FOLIOHanet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib9)\),ContextHub\_abductiveHuaet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib10)\),ContextHub\_deductiveHuaet al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib10)\), andlsatZhonget al\.\([2023](https://arxiv.org/html/2605.22873#bib.bib11)\)\.

Models\.We test 4 LLMs to validate generalization\. Base models: \(1\)Llama\-3\.2\-3B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib12)\), \(2\)Llama\-3\.1\-8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib12)\), \(3\)Qwen2\.5\-7B\-InstructHuiet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib14)\), representing diverse scales \(3B–8B\) and families\. Reasoning\-enhanced model: \(4\)Qwen3\-4B\-Instruct\-2507Yanget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib13)\), trained explicitly for chain\-of\-thought generation with built\-in think mode\. Unlike base models, it is prone to over\-reasoning, which makes it an ideal candidate for validating EDRM’s adaptive and robust routing capabilities under adversarial circumstances\.

Baselines\.We compare 9 decoding strategies across two categories:Static regimesapply a fixed decoding mode to all instances: \(1\)Direct\(no reasoning\), \(2\)Standard\(minimal prompting\), and \(3\)CoT\(always\-on chain\-of\-thought\)\.Adaptive routingdynamically selects among regimes: \(4\)Token\-Signature, a two\-way routing baseline most similar to EDRM; andEDRM variantswith two granularities—\(5–6\)EDRM\-Global\-E/Cfor dataset\-level routing and \(7–8\)EDRM\-Inst\-E/Cfor instance\-level routing, where “E” uses empirical thresholds \(SH,th=32S\_\{H,\\text\{th\}\}\{=\}32for base,1010for reasoning models\) and “C” uses cross\-dataset calibration; \(9\)EDRM\-MLPemploys a learned instance\-level router\. All stochastic variants are evaluated over 8 random seeds with mean and variance reported\.

Evaluation metrics\.We report accuracy and average token consumption to capture the performance–efficiency trade\-off\. For multiple trials, we report mean and variance to indicate result stability\. Additional metrics \(e\.g\., consistency\) are included if necessary\. For the detailed calculation of accuracy and token consumption, please refer to Appendix[B\.3](https://arxiv.org/html/2605.22873#A2.SS3)\.

Other details\.We set probe lengthN=64N=64to balance feature richness and computation cost\. Heuristic router hyperparameterk=0\.07k=0\.07\. All generation uses greedy decoding\. If sampling required, we sample 50 instances \(1%–7% of the full set\) and run 8 trials with different random seeds\. The MLP results reported in the main text are based on the full entropy trajectory as input\(64D\) with original label, while the results of 64D\-input MLP with calibration label and 3D\-input and 67\-input MLP with original/calibration label are presented in Appendix[A\.6](https://arxiv.org/html/2605.22873#A1.SS6)\. More details about experiments and other results \(e\.g\., hyperparameter sensitivity\) are in Appendix[B](https://arxiv.org/html/2605.22873#A2)\.

### 4\.2Main Results and Analysis

Table 1:Overall performance comparison across 4 LLMs\. EDRM achieves consistent token reduction \(27–55%\) while improving accuracy on base models compared toCoT\. Instance\-level routing \(EDRM\-Inst\-E/C/MLP\) consistently outperforms global\-level routing on base models, demonstrating the advantage of fine\-grained adaptation\. Accuracy values are weighted means across all benchmarks\.ModelDirectStandardCoTToken\-SignGlobal\-EGlobal\-CInst\-EInst\-CEDRM\-MLPAccTokAccTokAccTokAccTokAcc¯\\overline\{\\mathrm\{Acc\}\}Tok¯\\overline\{\\mathrm\{Tok\}\}Acc¯\\overline\{\\mathrm\{Acc\}\}Tok¯\\overline\{\\mathrm\{Tok\}\}AccTokAcc¯\\overline\{\\mathrm\{Acc\}\}Tok¯\\overline\{\\mathrm\{Tok\}\}AccTokLlama3\.2\-3B52\.754\.459\.92194\.460\.84251\.759\.63184\.961\.89113\.261\.89113\.265\.24179\.464\.70186\.966\.64178\.9Llama3\.1\-8B59\.005\.068\.21127\.168\.07277\.865\.71170\.868\.48164\.468\.46167\.471\.97153\.971\.01170\.1672\.27149\.9Qwen2\.5\-7B64\.633\.472\.89240\.673\.38330\.372\.67247\.474\.01191\.874\.01190\.678\.11242\.776\.91255\.377\.93240\.4Qwen3\-4B\-T65\.636\.280\.96518\.481\.35642\.577\.65400\.678\.23335\.178\.23335\.180\.29401\.179\.03467\.581\.20424\.8

Table 2:EDRM\-Global\(E&C\) Performance comparison ofLlama\-3\.2\-3B\-Instructacross 15 benchmarks\. Both EDRM variants achieve 61\.89% accuracy with 113\.2 average tokens, reducing token cost by 55\.0% compared to CoT \(251\.7 tokens\) while improving accuracy by 1\.05 percentage points\. All per\-dataset accuracy variances are below1\.8×10−31\.8\\times 10^\{\-3\}\.DatasetDirectStandardCoTEDRM\-Global\-EEDRM\-Global\-CAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CARC\-C71\.934\.076\.54177\.975\.17243\.871\.934\.08:0:071\.934\.08:0:0ARC\-E87\.374\.088\.85169\.289\.23237\.187\.374\.08:0:087\.374\.08:0:0BBH44\.654\.253\.33215\.555\.06277\.853\.55223\.30:7:153\.55223\.30:7:1CSQA65\.605\.467\.9088\.365\.44186\.165\.605\.48:0:065\.605\.48:0:0CH\-Abd33\.463\.635\.08242\.236\.42254\.635\.25243\.70:7:135\.25243\.70:7:1CH\-Ded20\.673\.942\.83280\.141\.54258\.142\.02266\.40:3:542\.02266\.40:3:5FOLIO43\.273\.844\.77298\.248\.59334\.448\.59334\.40:0:848\.59334\.40:0:8GPQA33\.934\.025\.67530\.024\.78767\.030\.83201\.35:3:030\.83201\.35:3:0GSM8K8\.426\.075\.66227\.274\.07231\.575\.06228\.80:5:375\.06228\.80:5:3LSAT41\.134\.041\.13303\.339\.25420\.541\.134\.08:0:041\.134\.08:0:0MultiArith24\.505\.696\.83129\.697\.67115\.797\.46119\.20:2:697\.46119\.20:2:6MuSR51\.324\.148\.68154\.949\.74178\.151\.324\.18:0:051\.324\.18:0:0PIQA75\.524\.871\.49130\.472\.52191\.675\.524\.88:0:075\.524\.88:0:0SIQA63\.154\.964\.8476\.063\.36191\.163\.154\.98:0:063\.154\.98:0:0StratQA81\.404\.361\.44121\.970\.52225\.981\.404\.38:0:081\.404\.38:0:0Overall52\.754\.459\.92194\.460\.84251\.761\.89113\.2–61\.89113\.2–

#### 4\.2\.1EDRM\-Global Performance

Detailed results of EDRM\-Global onLlama\-3\.2\-3B\-Instructare presented in Table[2](https://arxiv.org/html/2605.22873#S4.T2), while the overall results across 4 LLMs are shown in Table[1](https://arxiv.org/html/2605.22873#S4.T1)\. Detailed results of EDRM\-Global on other models are in Appendix[B\.4](https://arxiv.org/html/2605.22873#A2.SS4)\. We summarize the key findings as follows\.

##### Stability under sampling & threshold variance\.

EDRM\-Global\-E \(empirical thresholds\) and EDRM\-Global\-C \(calibrated thresholds\) produce identical routing decisions across 8 random seeds onLlama\-3\.2\-3B\-Instruct, with per\-dataset accuracy variance less than1\.8​e−31\.8e^\{\-3\}\. The highly concentrated D:S:C counts \(e\.g\., 8:0:0 forARC\-C, 0:0:8 forFOLIO\) confirm that lightweight global probing yields reliable, low\-variance dataset\-level routing\.

##### Strong accuracy–efficiency trade\-off\.

EDRM\-Global achieves 61\.89% accuracy \(\+1\.05% overCoT\) while reducing token consumption by 55\.0% \(251\.7→\\to113\.2\) onLlama\-3\.2\-3B\-Instruct\. It also surpasses theStandardbaseline in both accuracy \(59\.92%→\\to61\.89%\) and cost \(194\.4→\\to113\.2\), showing effective mitigation of over\-reasoning without performance loss\.

##### High alignment with reasoning demand\.

EDRM\-Global consistently selectsDirectfor retrieval\-heavy tasks \(ARC\-C/E,CSQA,PIQA\) whereCoToffers negligible gains at1010–100×100\\timestoken cost, andCoTfor formal logic \(FOLIO\) where structured reasoning is essential\. Mixed routing on multi\-step benchmarks \(GSM8K,BBH\) reveals decision\-boundary sensitivity, motivating instance\-level refinement\.

Table 3:EDRM instance\-level Performance onLlama\-3\.2\-3B\-Instruct\. EDRM\-MLP achieves 66\.64%Accwith 178\.9 tokens, outperforming both heuristic variants \(EDRM\-Inst\-E/C\)\. EDRM\-Inst\-E achieves best token efficiency \(179\.4 tokens\)\. EDRM\-Inst\-C shows variance below1\.4×10−61\.4\\times 10^\{\-6\}across 8 random seeds\. Detailed per\-dataset results deployed on other models are in Appendix[B\.5](https://arxiv.org/html/2605.22873#A2.SS5)\.DatasetDirectStandardCoTToken\-SignEDRM\-Inst\-EEDRM\-Inst\-CEDRM\-MLPAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\mathrm\{Acc\}\}\(%\)Tok¯\\overline\{\\mathrm\{Tok\}\}Acc\(%\)TokARC\-C71\.934\.076\.54177\.975\.17243\.873\.72141\.977\.56129\.877\.56128\.078\.24134\.4ARC\-E87\.374\.088\.85169\.289\.23237\.187\.88138\.589\.73132\.189\.65129\.690\.49131\.0BBH44\.654\.253\.33215\.555\.06277\.852\.70261\.758\.87242\.258\.87242\.059\.35206\.1CSQA65\.605\.467\.9088\.365\.44186\.165\.1994\.168\.2282\.468\.1882\.369\.9479\.0CH\-Abd33\.463\.635\.08242\.236\.42254\.635\.29217\.446\.88220\.546\.77218\.550\.75252\.7CH\-Ded20\.673\.942\.83280\.141\.54258\.136\.29262\.441\.62258\.641\.43257\.042\.25251\.6FOLIO43\.273\.844\.77298\.248\.59334\.448\.17329\.257\.56324\.157\.56322\.255\.32260\.1GPQA33\.934\.025\.67530\.024\.78767\.029\.46327\.538\.84358\.838\.84354\.242\.06498\.6GSM8K8\.426\.075\.66227\.274\.07231\.558\.61233\.872\.40251\.972\.37251\.575\.89234\.4LSAT41\.134\.041\.13303\.339\.25420\.541\.63293\.446\.98287\.446\.98285\.648\.36316\.9MultiArith24\.505\.696\.83129\.697\.67115\.785\.00160\.894\.50159\.494\.50159\.496\.17135\.4MuSR51\.324\.148\.68154\.949\.74178\.151\.4691\.954\.6392\.354\.5691\.657\.14130\.6PIQA75\.524\.871\.49130\.472\.52191\.675\.4699\.977\.0497\.477\.0497\.177\.31103\.2SIQA63\.154\.964\.8476\.063\.36191\.163\.2072\.663\.7766\.863\.7666\.867\.0991\.9StratQA81\.404\.361\.44121\.970\.52225\.977\.16139\.682\.49118\.482\.49117\.882\.9390\.7Overall52\.754\.459\.92194\.460\.84251\.759\.63184\.965\.24179\.464\.70186\.966\.64178\.9

Table 4:Ablation study on EDRM\-Inst\-E across 4 models\. Removing the fallback compensation or individual entropy features consistently degrades performance\.ModelFull \(EDRM\-Inst\-E\)w/o Fallbackw/oSHS\_\{H\}w/oavnra\_\{\\text\{vnr\}\}Acc \(%\)TokenAcc \(%\)TokenAcc \(%\)TokenAcc \(%\)TokenLlama\-3\.2\-3B65\.24179\.459\.98177\.159\.88185\.958\.94181\.5Llama\-3\.1\-8B71\.97153\.967\.30149\.867\.38146\.965\.94147\.3Qwen2\.5\-7B78\.11242\.773\.36240\.773\.07249\.172\.66246\.0Qwen3\-4B\-T80\.29401\.178\.42397\.979\.21440\.877\.65401\.2

#### 4\.2\.2Instance\-Level Routing Performance

To mitigate sampling sensitivity on multi\-step benchmarks \(GSM8K,BBH\), we evaluate instance\-level routing, which dynamically adapts to per\-sample entropy trajectories for fine\-grained, on\-demand reasoning\. Results in Tables[3](https://arxiv.org/html/2605.22873#S4.T3)and[1](https://arxiv.org/html/2605.22873#S4.T1)yield three key insights\.

##### Continuous & Consistent performance gains\.

Compared to coarse\-grained EDRM\-Global, instance\-level routing yields consistent gains on all models \(Table[1](https://arxiv.org/html/2605.22873#S4.T1)\)\. OnLlama\-3\.2\-3B\-Instruct, EDRM\-Inst\-E increases accuracy from 61\.89% \(Global\) to 65\.24% \(\+3\.35%\) while maintaining 28\.7% token savings versusCoT\(Table[3](https://arxiv.org/html/2605.22873#S4.T3)\)\. Similar improvements hold for larger models: \+3\.49% onLlama\-3\.1\-8Band \+4\.10% onQwen2\.5\-7B\. This confirms that per\-sample discrimination of boundary cases is essential for unlocking entropy dynamics\-based routing’s full potential\.

##### Efficiency spectrum across 3 variants\.

The 3 instance\-level variants form a tunable performance–efficiency spectrum \(Table[3](https://arxiv.org/html/2605.22873#S4.T3)\): \(1\)EDRM\-Inst\-Eachieves 65\.24% accuracy with 179\.4 tokens; \(2\)EDRM\-Inst\-Cguarantees deployment stability \(variance<1\.4×10−6<1\.4\\times 10^\{\-6\}on 8 seeds\); and \(3\)EDRM\-MLPachieves peak accuracy \(66\.64%, \+1\.40% over Inst\-E\)\. Crucially, the MLP’s learned decision boundaries exhibit strong alignment with hand\-crafted heuristic thresholds\. This convergence provides direct empirical validation that our entropy descriptors\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)capture informative signals, confirming the training\-free heuristic as a theoretically grounded proxy\.

##### Task\-aware dynamic allocation validates routing rationale\.

Instance\-level routing demonstrates precise compute\-budget allocation: it dynamically triggersCoTon high\-reasoning benchmarks \(CH\-Abd,BBH\), boosting accuracy significantly \(e\.g\.,BBH: 59\.35% vs\. Global 53\.55%\), while suppressing redundant generation on light\-reasoning tasks \(StratQA\), compressing tokens from 225\.9 \(CoT\) to 90\.7 \(EDRM\-MLP\)\. This “reason only when needed” mechanism constitutes the core advantage over fixed global policies and establishes a strong foundation for cross\-model generalization\.

#### 4\.2\.3Generalization across Diverse Settings

Instance\-level EDRM variants uniformly outperform static baselines and Global routing across all four LLMs \(Table[1](https://arxiv.org/html/2605.22873#S4.T1)\), with accuracy gains of \+3\.5% to \+4\.7% over CoT on base models while reducing token consumption by 27–45%\. This confirms that entropy dynamics\-based routing signals generalize beyond architecture\-specific behaviors\.

##### Robustness under adversarial circumstances\.

As our special design, on the reasoning\-enhancedQwen3\-4B\-T—which exhibits strong over\-reasoning bias under static CoT \(642\.5 tokens\)—EDRM effectively suppresses redundant deliberation\. EDRM\-Inst\-E reduces token cost by 37\.6% \(401\.1 tokens\) while achieving 80\.29%Acc\(vs\. 81\.35% for CoT\); EDRM\-MLP achieves 81\.20% accuracy with 33\.9% token savings\. This demonstrates that entropy dynamics provide reliable routing signals even when models are explicitly biased toward verbose reasoning\.

### 4\.3Ablation and Analysis

In this section we present ablation of the core variantEDRM\-Instance\-E\. EDRM\-Instance\-E relies on two key design elements: A\. synergistic routing via 3D entropy dynamics\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\), and B\. fallback compensation for decision robustness\. We construct 3 ablated variants by removing critical components: \(1\)w/o Fallback: Remove theDirectfallback branch; \(2\)w/oSHS\_\{H\}: Remove cumulative\-entropy threshold; use only the coupling ofVspV\_\{\\text\{sp\}\}andavnra\_\{\\text\{vnr\}\}; \(3\)w/oavnra\_\{\\text\{vnr\}\}: Further remove volatility; route using only the univariate trendVspV\_\{\\text\{sp\}\}\(degenerating to a Token\-Signature\-like baseline\)\. Results in Table[4](https://arxiv.org/html/2605.22873#S4.T4)reveal the distinct contributions and synergistic effects of the two design elements:

##### Fallback compensation prevents catastrophic failures\.

Removing the fallback branch \(w/o Fallback\) reduces accuracy by 3\.5–4\.8% across all models while narrowing token savings\. This indicates that even when routing favorsStandard/CoT, early probing bias or mid\-generation drift can still cause failures\. The parallelDirectbranch acts as a low\-cost “safety net” that intercepts such failures, preserving the high\-accuracy–low\-overhead balance central to EDRM’s design\.

##### 3D feature synergy optimizes the efficiency frontier\.

Feature ablation reveals non\-redundancy and complementarity among the three descriptors\. RemovingSHS\_\{H\}\(w/oSHS\_\{H\}\) degrades accuracy andincreasestoken consumption \(e\.g\., \+43\.8 tokens onQwen3\-4B\-T\), confirming thatSHS\_\{H\}effectively identifies “early uncertainty overload” samples and routes them toDirectto avoid futile reasoning expansion\. Further removingavnra\_\{\\text\{vnr\}\}\(w/oavnra\_\{\\text\{vnr\}\}\) causes additional accuracy drops \(1\.0–2\.6%\), demonstrating that univariate trendVspV\_\{\\text\{sp\}\}is susceptible to local noise, while volatilityavnra\_\{\\text\{vnr\}\}provides a critical stability prior to distinguish genuine convergence from spurious oscillations\. The low\-dimensional manifold\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)thus achieves a Pareto\-optimal trade\-off between discriminative power and information redundancy; removing any dimension disrupts this balance\.

## 5Conclusion

This work revisits LLM reasoning from a dynamical systems perspective and establishes that the utility of explicit reasoning emerges from decoding\-time entropy dynamics rather than being a fixed property of models or tasks\. Our systematic analysis reveals that successful reasoning corresponds to a phase\-transition\-like shift from high\-entropy exploratory regimes to low\-entropy structured convergence, while ineffective reasoning remains trapped in oscillatory or divergent dynamics\. These distinct patterns naturally organize into separable regions in a compact three\-dimensional entropy manifold\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\), providing both theoretical insight and a practical signal for adaptive control\. Based on this foundation, we introduce EDRM, a lightweight and training\-free framework that embeds early\-stage entropy dynamics into a reasoning manifold for instance\-adaptive inference routing\. Extensive experiments across 15 benchmarks and 4 LLMs demonstrate EDRM’s effectiveness: at the dataset level, it achieves 41–55% token reduction while improving accuracy with minimal calibration; at the instance level, it further improves accuracy by up to 4\.7% while maintaining 27–45% token savings\. Our results suggest that reasoning is better understood as a controllable decoding state that should be invoked selectively based on real\-time generation dynamics rather than static task categories\. This perspective opens new avenues for efficient LLM inference, particularly in resource\-constrained and latency\-sensitive applications\.

##### Limitations and Future Work\.

EDRM still requires a short probing stage, introducing moderate overhead compared to pure direct decoding\. Our current experiments are also limited to open\-source text\-based models in the 3B–8B range; extending the framework to larger\-scale, API\-only, and multimodal systems remains important future work\. In addition, we view EDRM as an initial step toward adaptive reasoning control in autonomous agents\. Integrating entropy\-dynamic routing into production\-scale agent frameworks such as OpenCLAW may further improve efficiency and robustness in long\-horizon multi\-turn reasoning\.

## References

- PIQA: reasoning about physical commonsense in natural language\.InAAAI Conference on Artificial Intelligence,External Links:[Link](https://api.semanticscholar.org/CorpusID:208290939)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- W\. Chen, L\. Peng, T\. Tan, C\. Zhao, B\. J\. Chen, Z\. Lin, A\. Go, and Y\. Meng \(2026\)Think deep, not just long: measuring llm reasoning effort via deep\-thinking tokens\.arXiv preprint arXiv:2602\.13517\.External Links:[Link](https://api.semanticscholar.org/CorpusID:285616013)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.ArXivabs/2110\.14168\.External Links:[Link](https://api.semanticscholar.org/CorpusID:239998651)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- A\. Conmy, A\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.Advances in Neural Information Processing Systems36,pp\. 16318–16352\.Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Ding, A\. Mallick, S\. Zhang, C\. Wang, D\. Madrigal, M\. del Carmen Hipolito Garcia, M\. Xia, L\. V\. S\. Lakshmanan, Q\. Wu, and V\. Rühle \(2025\)Best\-route: adaptive llm routing with test\-time optimal compute\.arXiv preprint arXiv:2506\.22716\.External Links:[Link](https://api.semanticscholar.org/CorpusID:280011744)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625 – 630\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270615909)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Fartale, A\. Kattamuri, R\. Raja, A\. Vats, I\. Prasad, and A\. K\. Moharir \(2025\)Disentangling recall and reasoning in transformer models through layer\-wise attention and activation analysis\.arXiv preprint arXiv:2510\.03366\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281843350)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Fein\-Ashley, D\. Parikh, R\. Kannan, and V\. K\. Prasanna \(2025\)Mixture of thoughts: learning to aggregate what experts think, not just what they say\.arXiv preprint arXiv:2509\.21164\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281525929)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Fernandez, B\. Kveton, R\. A\. Rossi, A\. S\. Lan, and J\. Z\. Wang \(2025\)RADAR: reasoning\-ability and difficulty\-aware routing for reasoning llms\.arXiv preprint arXiv:2509\.25426\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281681703)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Gan, Y\. Liao, and Y\. Liu \(2025\)Rethinking external slow\-thinking: from snowball errors to probability of correct reasoning\.arXiv preprint arXiv:2501\.15602\.External Links:[Link](https://api.semanticscholar.org/CorpusID:275921243)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.External Links:[Link](https://api.semanticscholar.org/CorpusID:230799347)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[1st item](https://arxiv.org/html/2605.22873#A2.I5.i1.p1.1),[2nd item](https://arxiv.org/html/2605.22873#A2.I5.i2.p1.1),[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p3.1)\.
- S\. Han, H\. Schoelkopf, Y\. Zhao, Z\. Qi, M\. Riddell, W\. Zhou, J\. Coady, D\. Peng, Y\. Qiao, L\. Benson,et al\.\(2024\)Folio: natural language reasoning with first\-order logic\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 22017–22031\.Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- J\. He, M\. Liu, O\. P\. Olaleye, A\. Agarwal, M\. Avendi, Y\. Abbasi, M\. Rowe, H\. L\. Patel, P\. Li, T\. Sheng,et al\.\(2026\)Think twice before you write–an entropy\-based decoding strategy to enhance llm reasoning\.arXiv preprint arXiv:2604\.00018\.Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Hua, K\. Zhu, L\. Li, L\. Fan, M\. Jin, S\. Lin, H\. Xue, Z\. Li, J\. Wang, and Y\. Zhang \(2025\)Disentangling logic: the role of context in large language model reasoning capabilities\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 19219–19242\.Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- J\. Huang, Z\. Zhang, K\. Shi, Y\. Ye, and C\. Zhang \(2026\)EvolveRouter: co\-evolving routing and prompt for multi\-agent question answering\.arXiv preprint arXiv:2604\.05149\.External Links:[Link](https://api.semanticscholar.org/CorpusID:287208061)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu,et al\.\(2024\)Qwen2\. 5\-coder technical report\.arXiv preprint arXiv:2409\.12186\.Cited by:[3rd item](https://arxiv.org/html/2605.22873#A2.I5.i3.p1.1),[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p3.1)\.
- L\. Jiang, X\. Wu, S\. Huang, Q\. Dong, Z\. Chi, L\. Dong, X\. Zhang, T\. Lv, L\. Cui, and F\. Wei \(2025\)Think only when you need with large hybrid\-reasoning models\.arXiv preprint arXiv:2505\.14631\.External Links:[Link](https://api.semanticscholar.org/CorpusID:278768506)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Jin, J\. W\. Yeom, S\. Bae, and T\. Kim \(2025\)“Well, keep thinking”: enhancing llm reasoning with adaptive injection decoding\.InAssociation for Computational Linguistics,pp\. 9989–10018\.Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Kim, M\. Valentino, and A\. Freitas \(2024\)Reasoning circuits in language models: a mechanistic interpretation of syllogistic inference\.InAssociation for Computational Linguistics,pp\. 10074–10095\.External Links:[Link](https://api.semanticscholar.org/CorpusID:271892176)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Li, H\. Liu, D\. Zhou, and T\. Ma \(2024\)Chain of thought empowers transformers to solve inherently serial problems\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Liu, B\. Zou, K\. Chen, J\. Liu, W\. Wang, and H\. Li \(2026\)Task\-aware llm routing with multi\-level task\-profile\-guided data synthesis for cold\-start scenarios\.External Links:[Link](https://api.semanticscholar.org/CorpusID:287352177)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Liu, F\. Xu, and Y\. Li \(2025\)Token signature: predicting chain\-of\-thought gains with token decoding feature in large language models\.ArXivabs/2506\.06008\.External Links:[Link](https://api.semanticscholar.org/CorpusID:279243226)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Liu, J\. Geng, A\. J\. Wu, I\. Sucholutsky, T\. Lombrozo, and T\. L\. Griffiths \(2024\)Mind your step \(by step\): chain\-of\-thought can reduce performance on tasks where thinking makes humans worse\.arXiv preprint arXiv:2410\.21333\.External Links:[Link](https://api.semanticscholar.org/CorpusID:273662487)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Qi, J\. Ma, Z\. Yin, L\. Zhang, J\. Zhang, J\. Liu, F\. Tian, and T\. Liu \(2025\)Plan before solving: problem\-aware strategy routing for mathematical reasoning with llms\.arXiv preprint arXiv:2509\.24377\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281675988)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)Gpqa: a graduate\-level google\-proof q&a benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- S\. Roy and D\. Roth \(2016\)Solving general arithmetic word problems\.ArXivabs/1608\.01413\.External Links:[Link](https://api.semanticscholar.org/CorpusID:560565)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. Le Bras, and Y\. Choi \(2019\)Social iqa: commonsense reasoning about social interactions\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 4463–4473\.Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- C\. Shao, X\. Hu, Y\. Lin, and F\. Xu \(2025a\)Division\-of\-thoughts: harnessing hybrid language model synergy for efficient on\-device agents\.Proceedings of the ACM on Web Conference 2025,pp\. 1822–1833\.External Links:[Link](https://api.semanticscholar.org/CorpusID:276235473)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Shao, X\. Liu, Y\. Lin, F\. Xu, and Y\. Li \(2025b\)Route\-and\-reason: scaling large language model reasoning with reinforced model router\.arXiv preprint arXiv:2506\.05901\.External Links:[Link](https://api.semanticscholar.org/CorpusID:279245092)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Sharma and P\. Chopra \(2025\)Think just enough: sequence\-level entropy as a confidence signal for llm reasoning\.arXiv preprint arXiv:2510\.08146\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281950492)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.External Links:[Link](https://api.semanticscholar.org/CorpusID:271719990)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Speer, J\. Chin, and C\. Havasi \(2016\)ConceptNet 5\.5: an open multilingual graph of general knowledge\.InAAAI Conference on Artificial Intelligence,External Links:[Link](https://api.semanticscholar.org/CorpusID:15206880)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- Z\. Sprague, X\. Ye, K\. Bostrom, S\. Chaudhuri, and G\. Durrett \(2024a\)Musr: testing the limits of chain\-of\-thought with multistep soft reasoning\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 14670–14728\.Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- Z\. Sprague, F\. Yin, J\. D\. Rodriguez, D\. Jiang, M\. Wadhwa, P\. Singhal, X\. Zhao, X\. Ye, K\. Mahowald, and G\. Durrett \(2024b\)To cot or not to cot? chain\-of\-thought helps mainly on math and symbolic reasoning\.arXiv preprint arXiv:2409\.12183\.External Links:[Link](https://api.semanticscholar.org/CorpusID:272708032)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso, A\. Kluska, A\. Lewkowycz, A\. Agarwal, A\. Power, A\. Ray, A\. Warstadt, A\. W\. Kocurek, A\. Safaya, A\. Tazarv, A\. Xiang, A\. Parrish, A\. Nie, A\. Hussain, A\. Askell, A\. Dsouza, A\. Slone, A\. Rahane, A\. S\. Iyer, A\. Andreassen, A\. Madotto, A\. Santilli, A\. Stuhlmuller, A\. M\. Dai, A\. La, A\. K\. Lampinen, A\. Zou, A\. Jiang, A\. Chen, A\. Vuong, A\. Gupta, A\. Gottardi, A\. Norelli, A\. Venkatesh, A\. Gholamidavoodi, A\. Tabassum, A\. Menezes, A\. Kirubarajan, A\. Mullokandov, A\. Sabharwal, A\. Herrick, A\. Efrat, A\. Erdem, A\. Karakacs, B\. R\. Roberts, B\. S\. Loe, B\. Zoph, B\. Bojanowski, B\. Ozyurt, B\. Hedayatnia, B\. Neyshabur, B\. Inden, B\. Stein, B\. Ekmekci, B\. Y\. Lin, B\. S\. Howald, B\. Orinion, C\. Diao, C\. Dour, C\. Stinson, C\. Argueta, C\. F\. Ramírez, C\. Singh, C\. Rathkopf, C\. Meng, C\. Baral, C\. Wu, C\. Callison\-Burch, C\. Waites, C\. Voigt, C\. D\. Manning, C\. Potts, C\. Ramirez, C\. E\. Rivera, C\. Siro, C\. Raffel, C\. Ashcraft, C\. Garbacea, D\. Sileo, D\. Garrette, D\. Hendrycks, D\. Kilman, D\. Roth, D\. Freeman, D\. Khashabi, D\. Levy, D\. M\. Gonz’alez, D\. R\. Perszyk, D\. Hernandez, D\. Chen, D\. Ippolito, D\. Gilboa, D\. Dohan, D\. Drakard, D\. Jurgens, D\. Datta, D\. Ganguli, D\. Emelin, D\. Kleyko, D\. Yuret, D\. Chen, D\. Tam, D\. Hupkes, D\. Misra, D\. Buzan, D\. C\. Mollo, D\. Yang, D\. Lee, D\. Schrader, E\. Shutova, E\. D\. Cubuk, E\. Segal, E\. Hagerman, E\. Barnes, E\. Donoway, E\. Pavlick, E\. Rodolà, E\. Lam, E\. Chu, E\. Tang, E\. Erdem, E\. Chang, E\. A\. Chi, E\. Dyer, E\. J\. Jerzak, E\. Kim, E\. E\. Manyasi, E\. Zheltonozhskii, F\. Xia, F\. Siar, F\. Mart’inez\-Plumed, F\. Happ’e, F\. Chollet, F\. Rong, G\. Mishra, G\. I\. Winata, G\. de Melo, G\. Kruszewski, G\. Parascandolo, G\. Mariani, G\. X\. Wang, G\. Jaimovitch\-L’opez, G\. Betz, G\. Gur\-Ari, H\. Galijasevic, H\. Kim, H\. Rashkin, H\. Hajishirzi, H\. Mehta, H\. Bogar, H\. Shevlin, H\. Schutze, H\. Yakura, H\. Zhang, H\. M\. Wong, I\. Ng, I\. Noble, J\. Jumelet, J\. Geissinger, J\. Kernion, J\. Hilton, J\. Lee, J\. F\. Fisac, J\. B\. Simon, J\. Koppel, J\. Zheng, J\. Zou, J\. Koco’n, J\. Thompson, J\. Wingfield, J\. Kaplan, J\. Radom, J\. N\. Sohl\-Dickstein, J\. Phang, J\. Wei, J\. Yosinski, J\. Novikova, J\. Bosscher, J\. Marsh, J\. Kim, J\. Taal, J\. Engel, J\. O\. Alabi, J\. Xu, J\. Song, J\. Tang, J\. W\. Waweru, J\. Burden, J\. Miller, J\. U\. Balis, J\. Batchelder, J\. Berant, J\. Frohberg, J\. Rozen, J\. Hernández\-Orallo, J\. Boudeman, J\. Guerr, J\. Jones, J\. B\. Tenenbaum, J\. S\. Rule, J\. Chua, K\. Kanclerz, K\. Livescu, K\. Krauth, K\. Gopalakrishnan, K\. Ignatyeva, K\. Markert, K\. D\. Dhole, K\. Gimpel, K\. Omondi, K\. W\. Mathewson, K\. Chiafullo, K\. Shkaruta, K\. Shridhar, K\. McDonell, K\. Richardson, L\. Reynolds, L\. Gao, L\. Zhang, L\. Dugan, L\. Qin, L\. Contreras\-Ochando, L\. Morency, L\. Moschella, L\. Lam, L\. Noble, L\. Schmidt, L\. He, L\. O\. Col’on, L\. Metz, L\. K\. cSenel, M\. Bosma, M\. Sap, M\. ter Hoeve, M\. Farooqi, M\. Faruqui, M\. Mazeika, M\. Baturan, M\. Marelli, M\. Maru, M\. J\. R\. Quintana, M\. Tolkiehn, M\. Giulianelli, M\. Lewis, M\. Potthast, M\. L\. Leavitt, M\. Hagen, M\. Schubert, M\. Baitemirova, M\. Arnaud, M\. McElrath, M\. A\. Yee, M\. Cohen, M\. Gu, M\. Ivanitskiy, M\. Starritt, M\. Strube, M\. Swkedrowski, M\. Bevilacqua, M\. Yasunaga, M\. Kale, M\. Cain, M\. Xu, M\. Suzgun, M\. Walker, M\. Tiwari, M\. Bansal, M\. Aminnaseri, M\. Geva, M\. Gheini, T\. MukundVarma, N\. Peng, N\. A\. Chi, N\. Lee, N\. G\. Krakover, N\. Cameron, N\. Roberts, N\. Doiron, N\. Martinez, N\. Nangia, N\. Deckers, N\. Muennighoff, N\. S\. Keskar, N\. Iyer, N\. Constant, N\. Fiedel, N\. Wen, O\. Zhang, O\. Agha, O\. Elbaghdadi, O\. Levy, O\. Evans, P\. A\. M\. Casares, P\. Doshi, P\. Fung, P\. P\. Liang, P\. Vicol, P\. Alipoormolabashi, P\. Liao, P\. Liang, P\. Chang, P\. Eckersley, P\. M\. Htut, P\. Hwang, P\. Milkowski, P\. S\. Patil, P\. Pezeshkpour, P\. Oli, Q\. Mei, Q\. Lyu, Q\. Chen, R\. Banjade, R\. E\. Rudolph, R\. Gabriel, R\. Habacker, R\. Risco, R\. Milliere, R\. Garg, R\. Barnes, R\. A\. Saurous, R\. Arakawa, R\. Raymaekers, R\. Frank, R\. Sikand, R\. Novak, R\. Sitelew, R\. L\. Bras, R\. Liu, R\. Jacobs, R\. Zhang, R\. Salakhutdinov, R\. Chi, R\. Lee, R\. Stovall, R\. Teehan, R\. Yang, S\. Singh, S\. Mohammad, S\. Anand, S\. Dillavou, S\. Shleifer, S\. Wiseman, S\. Gruetter, S\. R\. Bowman, S\. S\. Schoenholz, S\. Han, S\. Kwatra, S\. A\. Rous, S\. Ghazarian, S\. Ghosh, S\. Casey, S\. Bischoff, S\. Gehrmann, S\. Schuster, S\. Sadeghi, S\. S\. Hamdan, S\. Zhou, S\. Srivastava, S\. Shi, S\. Singh, S\. Asaadi, S\. S\. Gu, S\. Pachchigar, S\. Toshniwal, S\. Upadhyay, S\. Debnath, S\. Shakeri, S\. Thormeyer, S\. Melzi, S\. Reddy, S\. P\. Makini, S\. Lee, S\. B\. Torene, S\. Hatwar, S\. Dehaene, S\. Divic, S\. Ermon, S\. Biderman, S\. L\. Lin, S\. Prasad, S\. T\. Piantadosi, S\. M\. Shieber, S\. Misherghi, S\. Kiritchenko, S\. Mishra, T\. Linzen, T\. Schuster, T\. Li, T\. Yu, T\. Ali, T\. Hashimoto, T\. Wu, T\. Desbordes, T\. Rothschild, T\. Phan, T\. Wang, T\. Nkinyili, T\. Schick, T\. Kornev, T\. Tunduny, T\. Gerstenberg, T\. Chang, T\. Neeraj, T\. Khot, T\. Shultz, U\. Shaham, V\. Misra, V\. Demberg, V\. Nyamai, V\. Raunak, V\. V\. Ramasesh, V\. U\. Prabhu, V\. Padmakumar, V\. Srikumar, W\. Fedus, W\. Saunders, W\. Zhang, W\. Vossen, X\. Ren, X\. Tong, X\. Zhao, X\. Wu, X\. Shen, Y\. Yaghoobzadeh, Y\. Lakretz, Y\. Song, Y\. Bahri, Y\. Choi, Y\. Yang, Y\. Hao, Y\. Chen, Y\. Belinkov, Y\. Hou, Y\. Hou, Y\. Bai, Z\. Seid, Z\. Zhao, Z\. Wang, Z\. J\. Wang, Z\. Wang, and Z\. Wu \(2022\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.ArXivabs/2206\.04615\.External Links:[Link](https://api.semanticscholar.org/CorpusID:263625818)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- T\. Su, M\. Zhang, and G\. He \(2025\)Entropy\-aware speculative decoding toward improved llm reasoning\.arXiv preprint arXiv:2512\.23765\.External Links:[Link](https://api.semanticscholar.org/CorpusID:284350835)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Sui, Y\. He, T\. Cao, S\. Han, and B\. Hooi \(2025\)Meta\-reasoner: dynamic guidance for optimized inference\-time reasoning in large language models\.arXiv preprint arXiv:2502\.19918\.External Links:[Link](https://api.semanticscholar.org/CorpusID:276647786)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Wang and D\. Zhou \(2024\)Chain\-of\-thought reasoning without prompting\.Advances in Neural Information Processing Systems37,pp\. 66383–66409\.Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[4th item](https://arxiv.org/html/2605.22873#A2.I5.i4.p1.1),[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p3.1)\.
- X\. Zhao \(2026\)Entropy trajectory shape predicts llm reasoning reliability: a diagnostic study of uncertainty dynamics in chain\-of\-thought\.External Links:[Link](https://api.semanticscholar.org/CorpusID:286673669)Cited by:[§1](https://arxiv.org/html/2605.22873#S1.p2.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Zhong, R\. Cui, Y\. Guo, Y\. Liang, S\. Lu, Y\. Wang, A\. Saied, W\. Chen, and N\. Duan \(2023\)AGIEval: a human\-centric benchmark for evaluating foundation models\.ArXivabs/2304\.06364\.External Links:[Link](https://api.semanticscholar.org/CorpusID:258108259)Cited by:[§4\.1](https://arxiv.org/html/2605.22873#S4.SS1.p2.1)\.
- C\. Zhu, S\. Wu, X\. Zeng, Z\. Xu, Z\. Kang, Y\. Guo, Y\. Lu, J\. Huang, and G\. Zhou \(2026\)EDIS: diagnosing llm reasoning via entropy dynamics\.arXiv preprint arXiv:2602\.01288\.External Links:[Link](https://api.semanticscholar.org/CorpusID:285269294)Cited by:[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.22873#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix AAppendix: Methodological Details

### A\.1Entropy Dynamics Descriptors

This section provides detailed definitions and computational procedures for the three entropy dynamics descriptors used in EDRM: cumulative entropySHS\_\{H\}, univariate trendVspV\_\{\\text\{sp\}\}, and volatilityavnra\_\{\\text\{vnr\}\}\. We also discuss the rationale for selecting these specific descriptors over other potential metrics, emphasizing their theoretical grounding and empirical effectiveness in capturing the nuanced dynamics of LLM reasoning processes\. Inspired by both kinematic physics ofS–V–a\(position–velocity–acceleration\) and time series analysis, we design the three descriptors to capture distinct yet complementary aspects of the entropy trajectory\.

##### Input to the descriptor extractor\.

For each instancexx, we run anNN\-stepStandardprobing decode and record the token\-level entropy sequenceEx=\{H1​\(x\),…,HT​\(x\)\}E\_\{x\}=\\\{H\_\{1\}\(x\),\\dots,H\_\{T\}\(x\)\\\}, whereT≤NT\\leq Nis the number of generated probing tokens before EOS\. In our instance\-level heuristic routing, ifT<NT<N\(early termination\), we treat the instance as “already confident” and route it toStandarddirectly \(i\.e\., descriptors are not required for that case\)\. Otherwise, we compute the three descriptors onExE\_\{x\}withT=NT=N\. The whole progress is as the following Algorithm[2](https://arxiv.org/html/2605.22873#alg2)\.

Algorithm 2Extracting entropy dynamics descriptors\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)1:Probing entropy sequence

E=\{H1,…,HT\}E=\\\{H\_\{1\},\\dots,H\_\{T\}\\\}, max length

NN, small constant

ϵ\\epsilon
2:Descriptors

\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)
3:if

T<NT<Nthen

4:returnEarlyStop⊳\\trianglerightroute to Standard; skip descriptor\-based routing

5:endif

6:

SH←∑i=1NHiS\_\{H\}\\leftarrow\\sum\_\{i=1\}^\{N\}H\_\{i\}
7:

Vsp←Spearman​\(\{1,…,N\},\{H1,…,HN\}\)V\_\{\\text\{sp\}\}\\leftarrow\\mathrm\{Spearman\}\(\\\{1,\\dots,N\\\},\\\{H\_\{1\},\\dots,H\_\{N\}\\\}\)
8:

H¯←1N​∑i=1NHi\\bar\{H\}\\leftarrow\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}H\_\{i\}
9:

Var←1N​∑i=1N\(Hi−H¯\)2\\mathrm\{Var\}\\leftarrow\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(H\_\{i\}\-\\bar\{H\}\)^\{2\}
10:

MSD←1N−1​∑i=1N−1\(Hi\+1−Hi\)2\\mathrm\{MSD\}\\leftarrow\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\-1\}\(H\_\{i\+1\}\-H\_\{i\}\)^\{2\}
11:

avnr←MSDVar\+ϵa\_\{\\text\{vnr\}\}\\leftarrow\\frac\{\\mathrm\{MSD\}\}\{\\mathrm\{Var\}\+\\epsilon\}
12:return

\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)

#### A\.1\.1Cumulative EntropySHS\_\{H\}

Like Position in kinematic physics,SHS\_\{H\}is defined as the cumulative sum of token\-level entropy over theNN\-step probing sequence, which captures the overall uncertainty accumulation during the early decoding phase\. Formally, for a given input instancexx, letpi​\(⋅\)p\_\{i\}\(\\cdot\)be the next\-token distribution at probing stepiiand its corresponding token\-level entropy sequenceHi​\(x\)H\_\{i\}\(x\), we compute

SH​\(x\)=∑i=1NHi​\(x\),Hi​\(x\)=−∑v∈𝒱pi​\(v\)​log⁡pi​\(v\)\.S\_\{H\}\(x\)=\\sum\_\{i=1\}^\{N\}H\_\{i\}\(x\),\\qquad H\_\{i\}\(x\)=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{i\}\(v\)\\log p\_\{i\}\(v\)\.\(6\)Here𝒱\\mathcal\{V\}is the vocabulary andHi​\(x\)H\_\{i\}\(x\)is the Shannon entropy of the next\-token distribution at stepii\.

##### Interpretation\.

SHS\_\{H\}measures the*total uncertainty budget*the model spends during early decoding\. A highSHS\_\{H\}indicates sustained uncertainty across multiple steps \(uncertainty overload\), where extended reasoning \(e\.g\., CoT\) is less likely to converge reliably and may amplify drift; a low\-to\-moderateSHS\_\{H\}suggests the model is already operating in a relatively confident regime where lightweight reasoning can be beneficial\.

##### WhySHS\_\{H\}instead of other cumulative summaries?

Alternative cumulative metrics include \(i\) average entropyH¯=1N​∑iHi\\bar\{H\}=\\frac\{1\}\{N\}\\sum\_\{i\}H\_\{i\}, \(ii\) maximum entropymaxi⁡Hi\\max\_\{i\}H\_\{i\}, and \(iii\) endpoint differenceΔ​H=H1−HN\\Delta H=H\_\{1\}\-H\_\{N\}\. We preferSHS\_\{H\}because:

- •H¯\\bar\{H\}is scale\-dependent and can be misleading whenNNvaries or when the trajectory has high volatility;
- •maxi⁡Hi\\max\_\{i\}H\_\{i\}over\-emphasizes a single step and discards global information about the trajectory;
- •Δ​H\\Delta Hignores mid\-trajectory oscillations and is sensitive to noise at the first/last steps\.

In contrast,SHS\_\{H\}aggregates information across all steps and provides a stable, scale\-consistent indicator of cumulative uncertainty\.

#### A\.1\.2Univariate TrendVspV\_\{\\text\{sp\}\}

##### Definition\.

The trend descriptorVspV\_\{\\text\{sp\}\}quantifies whether the entropy trajectory is monotonically decreasing \(convergent\) or increasing \(divergent\) over time\. We define it as the Spearman rank correlation between the step index and entropy values:

Vsp​\(x\)=Spearman​\(\{1,…,N\},\{H1​\(x\),…,HN​\(x\)\}\)\.V\_\{\\text\{sp\}\}\(x\)=\\mathrm\{Spearman\}\(\\\{1,\\dots,N\\\},\\\{H\_\{1\}\(x\),\\dots,H\_\{N\}\(x\)\\\}\)\.\(7\)Operationally, letrir\_\{i\}be the rank ofHiH\_\{i\}among\{H1,…,HN\}\\\{H\_\{1\},\\dots,H\_\{N\}\\\}\(ties are assigned average ranks\)\. Then

Vsp=Corr​\(\{1,…,N\},\{r1,…,rN\}\),V\_\{\\text\{sp\}\}=\\mathrm\{Corr\}\\left\(\\\{1,\\dots,N\\\},\\\{r\_\{1\},\\dots,r\_\{N\}\\\}\\right\),\(8\)whereCorr​\(⋅,⋅\)\\mathrm\{Corr\}\(\\cdot,\\cdot\)is the Pearson correlation computed on the rank\-transformed sequence\. By construction,Vsp∈\[−1,1\]V\_\{\\text\{sp\}\}\\in\[\-1,1\]: negative values indicate a decreasing entropy trend \(progressive uncertainty reduction\), while positive values indicate increasing uncertainty and higher drift risk\.

##### Interpretation\.

VspV\_\{\\text\{sp\}\}serves as the “velocity” signal in our kinematic analogy: it captures*directional evolution*rather than magnitude\. Intuitively, tasks that benefit from deliberate reasoning typically show a sustained reduction in uncertainty during early decoding, corresponding toVsp<0V\_\{\\text\{sp\}\}<0; tasks with weak/negative CoT gains often exhibit oscillation or upward drift, reflected byVsp≥0V\_\{\\text\{sp\}\}\\geq 0\.

##### Why Spearman instead of common trend metrics?

We choose Spearman correlation over several alternatives:

- •Linear regression slopeβ^\\hat\{\\beta\}ofHiH\_\{i\}onii: while it provides a signed trend, it is*scale\-dependent*and sensitive to outliers \(single entropy spikes can dominateβ^\\hat\{\\beta\}\)\. Moreover, entropy trajectories are frequently non\-linear \(piecewise or curved\), making a slope a fragile summary\.
- •Pearson correlationCorr​\(i,Hi\)\\mathrm\{Corr\}\(i,H\_\{i\}\): it also assumes linear association and is sensitive to magnitude outliers; it can overreact to a few extreme points\.
- •Average increment1N−1​∑i=2N\(Hi−Hi−1\)\\frac\{1\}\{N\-1\}\\sum\_\{i=2\}^\{N\}\(H\_\{i\}\-H\_\{i\-1\}\): this relies only on adjacent differences and is highly sensitive to local noise, which is common in token\-level entropy\.

In contrast,VspV\_\{\\text\{sp\}\}is*non\-parametric*\(depends only on ordering\), robust to non\-linear monotone patterns, and less sensitive to occasional spikes/dips\. This makes it well\-suited for noisy entropy trajectories produced by early decoding\.

#### A\.1\.3Volatilityavnra\_\{\\text\{vnr\}\}\(Von Neumann Ratio\)

##### Definition\.

We instantiate the volatility descriptor using the*Von Neumann ratio*\(VNR\), defined as the ratio between the mean square successive difference \(MSD\) and the \(population\) variance of the sequence:

avnr​\(x\)=1N−1​∑i=1N−1\(Hi\+1​\(x\)−Hi​\(x\)\)21N​∑i=1N\(Hi​\(x\)−H¯​\(x\)\)2\+ϵ,a\_\{\\text\{vnr\}\}\(x\)=\\frac\{\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\-1\}\\big\(H\_\{i\+1\}\(x\)\-H\_\{i\}\(x\)\\big\)^\{2\}\}\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\big\(H\_\{i\}\(x\)\-\\bar\{H\}\(x\)\\big\)^\{2\}\+\\epsilon\},\(9\)whereH¯​\(x\)=1N​∑i=1NHi​\(x\)\\bar\{H\}\(x\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}H\_\{i\}\(x\)andϵ\\epsilonis a small constant \(e\.g\.,10−810^\{\-8\}\)\. If the variance is zero, we setavnr​\(x\)=0a\_\{\\text\{vnr\}\}\(x\)=0\.

##### Interpretation\.

avnra\_\{\\text\{vnr\}\}compares anℓ2\\ell\_\{2\}\-type dispersion \(standard deviation\) against anℓ1\\ell\_\{1\}\-type dispersion \(mean absolute deviation, MAD\)\. It acts as a normalized “acceleration” signal: when the entropy curve oscillates sharply \(bursty uncertainty\), the squared deviations grow faster than absolute deviations, increasingavnra\_\{\\text\{vnr\}\}; when the curve is smooth and stable,avnra\_\{\\text\{vnr\}\}is lower\. In routing,avnra\_\{\\text\{vnr\}\}plays a critical role as a stability prior: trend estimates \(e\.g\.,VspV\_\{\\text\{sp\}\}\) are less reliable under high volatility, so we couple them via the ratioVsp/avnrV\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\}in the heuristic rule\.

##### Why this volatility form instead of standard volatility metrics?

We preferavnra\_\{\\text\{vnr\}\}over several common choices:

- •Standard deviationσ\\sigma: captures global dispersion but is scale\-dependent and can be inflated by a few extreme points; used alone, it cannot distinguish “smoothly high” entropy from “highly oscillatory” entropy\.
- •Mean absolute increment1N−1​∑i=2N\|Hi−Hi−1\|\\frac\{1\}\{N\-1\}\\sum\_\{i=2\}^\{N\}\|H\_\{i\}\-H\_\{i\-1\}\|: focuses on local changes, but it is also sensitive to deterministic trends \(a steady monotone decrease can still yield large increments\), thus conflating trend with volatility\.
- •MAD alone1N​∑i\|Hi−H¯\|\\frac\{1\}\{N\}\\sum\_\{i\}\|H\_\{i\}\-\\bar\{H\}\|: more robust thanσ\\sigmabut still scale\-dependent and does not explicitly emphasize bursty oscillations\.

The ratioσ/MAD\\sigma/\\mathrm\{MAD\}is \(i\)*scale\-normalized*\(invariant to multiplying the entire entropy sequence by a constant\), \(ii\) sensitive to spiky/oscillatory deviations, and \(iii\) empirically effective as a noise\-aware normalization term when combined withVspV\_\{\\text\{sp\}\}\. This design matches our practical objective: separating genuine convergence \(stable downward trend\) from spurious monotonicity induced by noise\.

### A\.2Comparison of CoT and Direct Decoding across All Benchmarks

This section provides a comprehensive comparison ofCoTandDirectdecoding strategies across all evaluated benchmarks and models\. As shown in Figure[4](https://arxiv.org/html/2605.22873#A1.F4), we analyze the performance gain brought by Chain\-of\-Thought reasoning\. The results highlight distinct trends based on task type and model capability\.

##### Mathematical Reasoning\.

For math\-intensive benchmarks such asgsm8kandMArith,CoTdecoding yields massive performance improvements across all models\. The gains are so significant that they reach the upper limit of our visualization scale, indicating that step\-by\-step reasoning is essential for solving these complex arithmetic problems, far outperforming direct decoding\.

##### Logical Deduction and Complex Tasks\.

In logical reasoning tasks likeCH\-ded,bbh, andCH\-abd,CoTgenerally provides a positive boost, though the magnitude varies by model\. Notably, theQwen3\-4Bmodel \(red bars\) demonstrates exceptional robustness, achieving the highest CoT gains in several of these categories, often surpassing larger models like Llama\-3\.1\-8B and Qwen2\.5\-7B\. This suggests that Qwen3\-4B is particularly effective at leveraging reasoning paths for deduction\.

##### General Knowledge and Common Sense\.

For benchmarks involving common sense or general knowledge \(e\.g\.,siqa,arc\-e,csqa\), the advantages ofCoTare much more marginal\. Most models show only slight improvements or negligible differences compared to direct decoding\. In some cases, such aslsat,gpqa, andstqa, several models \(particularly the Llama series and Qwen2\.5\) exhibit negative gains, suggesting that for these specific tasks, generating a reasoning trace may introduce noise or errors, making direct decoding the superior strategy\.

##### Model Comparison\.

A key observation is the consistent performance ofQwen3\-4B\. While other models fluctuate between positive and negative gains depending on the benchmark, Qwen3\-4B maintains positive CoT gains across almost all tasks, including those where other models struggle \(e\.g\.,gpqa\)\. This indicates a strong generalization capability in utilizing Chain\-of\-Thought reasoning across diverse domains\.

![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/cot_gain_appendix.png)Figure 4:Comprehensive comparison ofCoTandDirectdecoding strategies across all evaluated benchmarks and models\. Each subplot corresponds to a specific model, with bars representing the accuracy ofCoTandDirecton each benchmark\. The performance differences are analyzed to identify patterns of CoT gains or losses across different tasks and model sizes\.

### A\.3Visualization of Entropy Trajectories of All Models and Benchmarks

As shown in Figure[5](https://arxiv.org/html/2605.22873#A1.F5), the entropy trajectories reveal distinct uncertainty dynamics across models\. Qwen3\-4B \(red line\) consistently maintains the lowest entropy levels across almost all benchmarks, exhibiting a stable trajectory with minimal fluctuation\. This indicates high confidence in token generation\. Conversely, Llama\-3B \(blue\) and Llama\-8B \(green\) generally display higher entropy with significant volatility, particularly in the early decoding steps\. Qwen2\.5\-7B \(orange\) typically falls between these extremes\.

![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/trend_all.jpg)Figure 5:Entropy trajectories for all evaluated models and benchmarks\. Each line represents the average entropy evolution for a specific model on the benchmark\. Each subplot corresponds to a specific benchmark, with lines of different colors representing the average entropy trajectory for each model\.![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/heatbase4.jpg)Figure 6:Unified Gain Heatmap with variousλ\\lambda\(0\.03,0\.05,0\.07,0\.10\) onbase models\.![Refer to caption](https://arxiv.org/html/2605.22873v1/fig/heatThink4.jpg)Figure 7:Unified Gain Heatmap with variousλ\\lambda\(0\.03,0\.05,0\.07,0\.10\) onthink models\.
### A\.4Unified Gain Heatmap

Figure[6](https://arxiv.org/html/2605.22873#A1.F6)and[7](https://arxiv.org/html/2605.22873#A1.F7)visualizes how the relative utility ofCoTversusDirectvaries across regions of the entropy\-dynamics manifold\. Since accuracy improvements alone do not reflect the substantial token overhead of explicit reasoning, we introduce a*Unified Gain*that jointly accounts for correctness and generation cost\.

##### Motivation: why unified gain?

Standard comparisons such asAcc​\(CoT\)−Acc​\(Direct\)\\mathrm\{Acc\}\(\\text\{CoT\}\)\-\\mathrm\{Acc\}\(\\text\{Direct\}\)ignore token usage, yetCoToften increases output length by orders of magnitude \(Tables[1](https://arxiv.org/html/2605.22873#S4.T1),[2](https://arxiv.org/html/2605.22873#S4.T2)\)\. For deployment, the relevant question is not only “which is more accurate?” but “which is more cost\-effective under a chosen accuracy–cost trade\-off?” Unified Gain provides a single scalar objective aligned with this requirement, enabling a fine\-grained visualization over the manifold coordinates\.

##### Relation to the main\-paper shorthand definition\.

In the main paper, we use a shorthand unified gainU​\(m\)=Acc​\(m\)−λ⋅Cost​\(m\)U\(m\)=\\mathrm\{Acc\}\(m\)\-\\lambda\\cdot\\mathrm\{Cost\}\(m\)for readability, whereAcc​\(m\)\\mathrm\{Acc\}\(m\)andCost​\(m\)\\mathrm\{Cost\}\(m\)can be understood as*already\-aggregated*statistics \(e\.g\., dataset\-level averages\)\. In this appendix, we restate the same objective at the*instance level*to remove ambiguity and to support heatmap construction: we first define a per\-instance utilityUx​\(m\)=𝕀​\[correctm​\(x\)\]−λ⋅T~m​\(x\)U\_\{x\}\(m\)=\\mathbb\{I\}\[\\mathrm\{correct\}\_\{m\}\(x\)\]\-\\lambda\\cdot\\tilde\{T\}\_\{m\}\(x\), and then aggregate by expectation/average over a set of instances\. This formulation makes explicit \(i\) that accuracy corresponds to the mean of correctness indicators, and \(ii\) what token cost means and how it is normalized\. WhenAcc​\(m\)=𝔼x​\[𝕀​\[correctm​\(x\)\]\]\\mathrm\{Acc\}\(m\)=\\mathbb\{E\}\_\{x\}\[\\mathbb\{I\}\[\\mathrm\{correct\}\_\{m\}\(x\)\]\]andCost​\(m\)=𝔼x​\[T~m​\(x\)\]\\mathrm\{Cost\}\(m\)=\\mathbb\{E\}\_\{x\}\[\\tilde\{T\}\_\{m\}\(x\)\], the appendix definition reduces exactly to the main\-paper shorthand\. We normalize tokens asT~m​\(x\)=Tm​\(x\)/1000\\tilde\{T\}\_\{m\}\(x\)=T\_\{m\}\(x\)/1000so thatλ\\lambdais comparable across datasets and models and remains numerically interpretable\.

##### Definition and computation\.

For an instancexxand decoding modem∈\{Direct,CoT\}m\\in\\\{\\text\{Direct\},\\text\{CoT\}\\\}, let𝕀​\[correctm​\(x\)\]∈\{0,1\}\\mathbb\{I\}\[\\text\{correct\}\_\{m\}\(x\)\]\\in\\\{0,1\\\}be the correctness indicator andTm​\(x\)T\_\{m\}\(x\)be the number of generated output tokens\.111In our experiments,Tm​\(x\)T\_\{m\}\(x\)corresponds to the recorded output token length in the evaluation logs\.We define the per\-instance utility as

Ux​\(m\)=𝕀​\[correctm​\(x\)\]−λ⋅Tm​\(x\)1000,Δ​Ux​\(CoT,Direct\)=Ux​\(CoT\)−Ux​\(Direct\),U\_\{x\}\(m\)=\\mathbb\{I\}\[\\text\{correct\}\_\{m\}\(x\)\]\-\\lambda\\cdot\\frac\{T\_\{m\}\(x\)\}\{1000\},\\qquad\\Delta U\_\{x\}\(\\text\{CoT\},\\text\{Direct\}\)=U\_\{x\}\(\\text\{CoT\}\)\-U\_\{x\}\(\\text\{Direct\}\),\(10\)whereλ≥0\\lambda\\geq 0is a cost\-penalty coefficient and the token cost is normalized by10001000to keepλ\\lambdain a human\-interpretable range\.222The main paper figure usesλ=0\.05\\lambda=0\.05for base models as a representative operating point\. Largerλ\\lambdaincreasingly favors shorter outputs; smallerλ\\lambdaapproaches accuracy\-only comparison\.A positiveΔ​Ux\\Delta U\_\{x\}meansCoTis more cost\-effective thanDirectonxxunder the chosenλ\\lambda, and a negative value means the opposite\.

##### From per\-instance gain to a heatmap over the manifold\.

To construct Figure[2](https://arxiv.org/html/2605.22873#S2.F2), we compute each instance’s manifold coordinates usingStandardprobing:

\(Vsp​\(x\)avnr​\(x\),SH​\(x\)\),\\left\(\\frac\{V\_\{\\text\{sp\}\}\(x\)\}\{a\_\{\\text\{vnr\}\}\(x\)\},\\,S\_\{H\}\(x\)\\right\),and then discretize the 2D plane into bins\. Each cell reports the average unified gain among instances that fall into the corresponding bin:

Δ​Ucell=𝔼​\[Δ​Ux​\(CoT,Direct\)∣x∈cell\]\.\\Delta U\_\{\\text\{cell\}\}=\\mathbb\{E\}\\left\[\\Delta U\_\{x\}\(\\text\{CoT\},\\text\{Direct\}\)\\mid x\\in\\text\{cell\}\\right\]\.This visualization directly links the observed efficiency frontier to entropy dynamics, without requiring any learned router\.

##### How to read the heatmap \(interpretation of the main\-paper figure\)\.

The heatmap in Figure[2](https://arxiv.org/html/2605.22873#S2.F2)shows a structured separation aligned with our routing logic:

- •Convergent region \(large negativeVsp/avnrV\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\}\):cells tend to have higher \(often positive\)Δ​U\\Delta U, indicating that when entropy decreases reliably relative to volatility,CoTis more likely to justify its extra tokens\.
- •Divergent/unstable region \(positiveVsp/avnrV\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\}\):cells are dominated by negativeΔ​U\\Delta U, consistent with the observation thatCoToften overthinks and wastes tokens when uncertainty drifts upward\.
- •Uncertainty\-load effect \(SHS\_\{H\}\):even when trend is not strongly positive, very largeSHS\_\{H\}corresponds to early\-stage uncertainty overload, where the additionalCoTbudget is less likely to convert into correctness, pushingΔ​U\\Delta Udownward\.

Overall, the heatmap provides an empirical justification for using the two\-axis manifold\(Vsp/avnr,SH\)\(V\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\},S\_\{H\}\)as the decision space: it exposes where reasoning is truly cost\-effective and where it is not\.

#### A\.4\.1Cross\-dataset Calibration ofSH,thS\_\{H,\\text\{th\}\}

##### Why calibration is needed?

The uncertainty\-overload thresholdSH,thS\_\{H,\\text\{th\}\}is*model\-dependent*\. While fixed empirical thresholds \(e\.g\.,SH,th=32S\_\{H,\\text\{th\}\}\{=\}32for base models and1010for think\-enabled models in our experiments\) can work well for the models studied in this paper, they are obtained by analyzing a closed set of model–dataset results and therefore may not transfer to a new model with different calibration, decoding behavior, or built\-in “thinking” biases\. Repeating a full benchmark sweep to re\-tuneSH,thS\_\{H,\\text\{th\}\}for every new model is costly and often impractical in cold\-start deployment\. To address this, we introduce a lightweight*calibration*procedure used by EDRM\-Global/Inst\-C: for any unseen model \(and optionally a new collection of datasets\), it estimates a suitableSH,thS\_\{H,\\text\{th\}\}from a small number of sampled instances via a simple heuristic, enabling fast and flexible threshold selection without additional training or exhaustive evaluation\.

This parts details the cross\-dataset heuristic calibration used by EDRM\-Global/Inst\-C\. Consider a fixed model evaluated onJJdatasets\{𝒟j\}j=1J\\\{\\mathcal\{D\}\_\{j\}\\\}\_\{j=1\}^\{J\}\. For each dataset, we samplenninstances, runNN\-step probing, and compute per\-instance features\. We then form dataset\-level means

μS\(j\)=1n​∑x∈𝒟sample\(j\)SH​\(x\),μV\(j\)=1n​∑x∈𝒟sample\(j\)Vsp​\(x\)\.\\mu\_\{S\}^\{\(j\)\}=\\frac\{1\}\{n\}\\sum\_\{x\\in\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}\}S\_\{H\}\(x\),\\qquad\\mu\_\{V\}^\{\(j\)\}=\\frac\{1\}\{n\}\\sum\_\{x\\in\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}\}V\_\{\\text\{sp\}\}\(x\)\.\(11\)Let

M=\|\{j∈\{1,…,J\}:μV\(j\)<0\}\|M=\\left\|\\left\\\{j\\in\\\{1,\\dots,J\\\}:\\mu\_\{V\}^\{\(j\)\}<0\\right\\\}\\right\|\(12\)be the number of datasets whose average probing dynamics are convergent \(μV\(j\)<0\\mu\_\{V\}^\{\(j\)\}<0\)\. Sort\{μS\(j\)\}j=1J\\\{\\mu\_\{S\}^\{\(j\)\}\\\}\_\{j=1\}^\{J\}in ascending order to obtains\(1\)≤⋯≤s\(J\)s\_\{\(1\)\}\\leq\\dots\\leq s\_\{\(J\)\}, and define

SH,th=⌊s\(M\)⌋,S\_\{H,\\text\{th\}\}=\\left\\lfloor s\_\{\(M\)\}\\right\\rfloor,\(13\)where we clamp the index asM←max⁡\(1,min⁡\(M,J\)\)M\\leftarrow\\max\(1,\\min\(M,J\)\)to handle boundary cases\. The whole calibration progress is presented in the following Algorithm[3](https://arxiv.org/html/2605.22873#alg3)\.

Algorithm 3Heuristic Calibration ofSH,thS\_\{H,\\text\{th\}\}Across Datasets1:Datasets

\{𝒟j\}j=1J\\\{\\mathcal\{D\}\_\{j\}\\\}\_\{j=1\}^\{J\}, per\-dataset sample size

nn, probe length

NN
2:Calibrated threshold

SH,thS\_\{H,\\text\{th\}\}
3:for

j=1j=1to

JJdo

4:Sample

𝒟sample\(j\)⊂𝒟j\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}\\subset\\mathcal\{D\}\_\{j\}with

\|𝒟sample\(j\)\|=n\|\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}\|=n
5:foreach

x∈𝒟sample\(j\)x\\in\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}do

6:Run

NN\-step probing on

xxto obtain entropy sequence

ExE\_\{x\}
7:Compute

SH​\(x\)S\_\{H\}\(x\)and

Vsp​\(x\)V\_\{\\text\{sp\}\}\(x\)
8:endfor

9:

μS\(j\)←1n​∑x∈𝒟sample\(j\)SH​\(x\)\\mu\_\{S\}^\{\(j\)\}\\leftarrow\\frac\{1\}\{n\}\\sum\_\{x\\in\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}\}S\_\{H\}\(x\)
10:

μV\(j\)←1n​∑x∈𝒟sample\(j\)Vsp​\(x\)\\mu\_\{V\}^\{\(j\)\}\\leftarrow\\frac\{1\}\{n\}\\sum\_\{x\\in\\mathcal\{D\}^\{\(j\)\}\_\{\\text\{sample\}\}\}V\_\{\\text\{sp\}\}\(x\)
11:endfor

12:

M←\|\{j:μV\(j\)<0\}\|M\\leftarrow\\left\|\\left\\\{j:\\mu\_\{V\}^\{\(j\)\}<0\\right\\\}\\right\|
13:Sort

\{μS\(j\)\}j=1J\\\{\\mu\_\{S\}^\{\(j\)\}\\\}\_\{j=1\}^\{J\}ascending to obtain

s\(1\)≤⋯≤s\(J\)s\_\{\(1\)\}\\leq\\cdots\\leq s\_\{\(J\)\}
14:

M←max⁡\(1,min⁡\(M,J\)\)M\\leftarrow\\max\(1,\\min\(M,J\)\)
15:

SH,th←⌊s\(M\)⌋S\_\{H,\\text\{th\}\}\\leftarrow\\lfloor s\_\{\(M\)\}\\rfloor
16:return

SH,thS\_\{H,\\text\{th\}\}

### A\.5Routing Decision Logic Analysis

This section analyzes the design rationale of EDRM’s three\-branch routing \(Direct,Standard,CoT\), the decision boundaries in Algorithm[1](https://arxiv.org/html/2605.22873#alg1), and how these boundaries relate to the entropy\-dynamics descriptors\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)\.

##### Why three branches \(Direct / Standard / CoT\)?

We adopt three regimes because they span the practical efficiency–reliability spectrum while remaining deployable across both base and think\-enabled models:

- •Direct \(minimum compute, maximum robustness against overthinking\)\.Direct is the lowest\-cost mode and is empirically strong on retrieval\-heavy or low\-reasoning tasks, where step\-by\-step deliberation often yields marginal or negative gains while substantially increasing token cost\. Direct is also a conservative choice under decoding drift risk\.
- •Standard \(neutral default and probing anchor\)\.Standard uses a minimally biased prompt \(no explicit CoT instruction\), providing a stable “native” decoding state for probing and serving as the middle option when the evidence for either extreme is insufficient\.
- •CoT \(maximum compute when progressive convergence is detected\)\.CoT is reserved for cases where early decoding shows stable uncertainty reduction \(a reliable convergence signature\)\. In such cases, additional structured reasoning is most likely to translate into accuracy gains\.

This tri\-partition is also operationally important: it avoids forcing a binary decision \(CoT vs\. Direct\) when the model is neither clearly convergent nor clearly divergent, reducing misrouting on boundary cases\.

##### Decision boundaries and their meaning in the reasoning manifold\.

EDRM uses two complementary signals: a*direction–stability*coupling via\(Vsp,avnr\)\(V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\)and an*uncertainty\-load*guardrail viaSHS\_\{H\}\. Concretely, Algorithm[1](https://arxiv.org/html/2605.22873#alg1)implements:

- •Divergence / drift region⇒\\RightarrowDirect\.IfVsp\>k⋅avnrV\_\{\\text\{sp\}\}\>k\\cdot a\_\{\\text\{vnr\}\}, the entropy trajectory has an overall*increasing*tendency, and the confidence does not improve with steps\. This is a high drift\-risk regime: extra reasoning tokens are likely to amplify exploration or produce verbose but ungrounded continuations\. Therefore, we route toDirect\.
- •Convergence region⇒\\RightarrowCoT\.IfVsp<−k⋅avnrV\_\{\\text\{sp\}\}<\-k\\cdot a\_\{\\text\{vnr\}\}, the trajectory exhibits a*stable monotone decrease*relative to its volatility\. This indicates progressive uncertainty reduction, where structured reasoning is most likely to be beneficial; we route toCoT\.
- •Uncertainty overload guardrail⇒\\RightarrowDirect\.Even whenVspV\_\{\\text\{sp\}\}is not strongly positive, ifVsp\>0V\_\{\\text\{sp\}\}\>0andSH\>SH,thS\_\{H\}\>S\_\{H,\\text\{th\}\}, the model accumulates large total uncertainty during early decoding\. Empirically, this region often corresponds to poor calibration or inability to settle on a coherent reasoning path; CoT tends to be long and unreliable\. We thus route toDirectto cap cost and reduce drift\.
- •Otherwise⇒\\RightarrowStandard\.When neither strong convergence nor strong divergence is detected, we selectStandardas the default middle ground\.

##### HowkkandSH,thS\_\{H,\\text\{th\}\}interact with the descriptors\.

Parameterkkcontrols how much negative/positive trend is required*relative to volatility*to trigger CoT/Direct\. Intuitively,avnra\_\{\\text\{vnr\}\}serves as a noise\-aware normalization: under high volatility,\|Vsp\|\|V\_\{\\text\{sp\}\}\|must be larger to be trusted as a genuine monotone trend\. ThresholdSH,thS\_\{H,\\text\{th\}\}acts as a capacity\-dependent ceiling on early uncertainty load \(base vs\. reasoning models differ\), preventing expensive CoT in regimes where the model is already “lost” in early decoding\.

##### Why use the dataset mean ofVspV\_\{\\text\{sp\}\}\(i\.e\.,V¯sp=𝔼​\[Vsp​\(x\)\]\\bar\{V\}\_\{\\text\{sp\}\}=\\mathbb\{E\}\[V\_\{\\text\{sp\}\}\(x\)\]\) instead of computingVspV\_\{\\text\{sp\}\}on the averaged entropy curve?

For dataset\-level routing, we aggregate descriptors asS¯H=1\|𝒟s\|​∑xSH​\(x\)\\bar\{S\}\_\{H\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{s\}\|\}\\sum\_\{x\}S\_\{H\}\(x\),V¯sp=1\|𝒟s\|​∑xVsp​\(x\)\\bar\{V\}\_\{\\text\{sp\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{s\}\|\}\\sum\_\{x\}V\_\{\\text\{sp\}\}\(x\),a¯vnr=1\|𝒟s\|​∑xavnr​\(x\)\\bar\{a\}\_\{\\text\{vnr\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{s\}\|\}\\sum\_\{x\}a\_\{\\text\{vnr\}\}\(x\)\. A tempting alternative is to first average the token\-level entropies across samples,H~i=1\|𝒟s\|​∑xHi​\(x\)\\tilde\{H\}\_\{i\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{s\}\|\}\\sum\_\{x\}H\_\{i\}\(x\), and then computeVsp​\(H~1:N\)V\_\{\\text\{sp\}\}\(\\tilde\{H\}\_\{1:N\}\)\. We intentionally avoid this for three reasons:

- •Non\-commutativity \(nonlinearity\) of Spearman\.Spearman correlation is computed on*ranks*and is not a linear operator, hence in general Spearman​\(i,𝔼​\[Hi​\(x\)\]\)≠𝔼​\[Spearman​\(i,Hi​\(x\)\)\]\.\\mathrm\{Spearman\}\\big\(i,\\mathbb\{E\}\[H\_\{i\}\(x\)\]\\big\)\\;\\neq\\;\\mathbb\{E\}\\big\[\\mathrm\{Spearman\}\(i,H\_\{i\}\(x\)\)\\big\]\.Averaging before ranking can change the ordering structure and distort the monotonicity signal\.
- •Heterogeneity cancellation\.Many datasets contain mixed instance types \(some clearly convergent, others divergent\)\. Averaging token\-level entropies across such a mixture often yields an artificially smooth curve that can appear weakly decreasing even when a large fraction of instances are increasing \(or vice versa\)\. In contrast, averaging per\-instanceVsp​\(x\)V\_\{\\text\{sp\}\}\(x\)preserves the sign/magnitude distribution of instance\-level trends, which is exactly what dataset\-level routing aims to summarize\.
- •Robustness to irregularities and early termination\.In practice, some instances may terminate early \(T<NT<N\) and are treated as confident \(routed toStandardwithout descriptor\-based decision\)\. Constructing an averaged entropy curve requires additional alignment/padding choices that introduce bias\. ComputingVsp​\(x\)V\_\{\\text\{sp\}\}\(x\)per valid instance \(withT=NT=N\) and then averaging avoids such artifacts and matches the actual routing semantics\.

Empirically, we findV¯sp\\bar\{V\}\_\{\\text\{sp\}\}computed as the mean of per\-instance trends aligns better with dataset\-level CoT gains and yields more stable routing under sampling \(cf\. low variances reported in Tables[2](https://arxiv.org/html/2605.22873#S4.T2)\)\.

### A\.6MLP Variants Design and Training Details

This subsection details the data construction, variant design, and training setup forEDRM\-MLP\. The router is trained as an instance\-level classifier over three regimesℳ∈\{Direct,Standard,CoT\}\\mathcal\{M\}\\in\\\{\\textit\{Direct\},\\textit\{Standard\},\\textit\{CoT\}\\\}using entropy\-dynamics signals extracted from anN=64N\{=\}64\-stepStandardprobing decode\.

##### Training targets \(multi\-label vs\. priority single\-label\)\.

For each instance, we evaluate all three decoding regimes and derive correctness indicators𝐲=\[yD,yS,yC\]∈\{0,1\}3\\mathbf\{y\}=\[y\_\{D\},y\_\{S\},y\_\{C\}\]\\in\\\{0,1\\\}^\{3\}, whereym=1y\_\{m\}\{=\}1iff modemmanswers correctly\. This yields theoriginal multi\-label target\(possibly multi\-hot, e\.g\.,\[1,1,0\]\[1,1,0\]; or\[0,0,0\]\[0,0,0\]when all fail\)\. To incorporate a token\-efficiency preference, we also construct apriority\-constrained single\-label target𝐲′\\mathbf\{y\}^\{\\prime\}by mapping multi\-hot labels to a one\-hot vector using the ruleDirect\>\>Standard\>\>CoT; samples with\[0,0,0\]\[0,0,0\]are filtered out in this setting\.

##### Input variants \(3D / 64D / 67D\)\.

All variants share the same probing source: we take the token\-level entropy sequenceE=\{H1,…,HT\}E=\\\{H\_\{1\},\\dots,H\_\{T\}\\\}recorded inStandarddecoding logs and align it to a fixed lengthN=64N\{=\}64by truncation or zero\-padding: ifT≥NT\\geq N, keep the firstNNentropies; otherwise padEEwith zeros to lengthNN\. We then form three input representations: \(i\)3D descriptors:\(SH,Vsp,avnr\)\(S\_\{H\},V\_\{\\text\{sp\}\},a\_\{\\text\{vnr\}\}\), whereSH=∑i=1NHiS\_\{H\}=\\sum\_\{i=1\}^\{N\}H\_\{i\},VspV\_\{\\text\{sp\}\}is the Spearman trend, andavnra\_\{\\text\{vnr\}\}is the Von Neumann ratio; \(ii\)64D trajectory: the aligned entropy vector\(H1,…,HN\)\(H\_\{1\},\\dots,H\_\{N\}\); \(iii\)67D hybrid: concatenation of the 64D trajectory and the 3D descriptors\. Combining the two label strategies with the three input forms yields the six ablation variants used in our experiments\.

##### Data construction and split\.

We build per\-instance features and labels by merging: \(a\) the entropy trajectory fromStandardprobing logs, and \(b\) per\-instance descriptors and correctness labels computed by the EDRM evaluation pipeline\. To keep training lightweight while preserving coverage, we perform10% stratified samplingfor training: for each dataset, we group instances by the 8 possible label patterns \(e\.g\.,000,001,…,111000,001,\\dots,111\) and sample 10% from each group \(at least one sample when available\)\. The remaining instances are used for testing\.

##### Model and optimization\.

We use a compact 3\-layer MLP that outputs 3 logits\. Inputs are standardized with aStandardScalerfitted on the training split\. Formulti\-labeltraining, we useBCEWithLogitsLosswith per\-classpos\_weight\(computed as neg/pos on the training set\) to mitigate label imbalance\. Forsingle\-labeltraining, we useCrossEntropyLosswith inverse\-frequency class weights\. Unless otherwise specified, we train with Adam, learning rate10−310^\{\-3\}, batch size 32, hidden dimension 128, weight decay10−410^\{\-4\}, and 100 epochs \(we use 120 epochs for the 64D multi\-label variant in our default scripts\)\.

## Appendix BAppendix: Experimental Details

### B\.1Datasets and Models

We evaluate our method on 15 diverse benchmarks spanning four key reasoning categories: mathematical reasoning, commonsense and knowledge reasoning, scientific reasoning, and formal / logical reasoning\. All datasets follow standard evaluation protocols in prior reasoning research\.

##### Mathematical reasoning\.

- •GSM8K: Elementary school math problems requiring multi\-step numerical reasoning\.
- •MultiArith: Arithmetic word problems testing basic numerical reasoning\.
- •BBH: A challenging subset of Big\-Bench Hard focusing on complex symbolic and logical reasoning\.

##### Commonsense & knowledge reasoning\.

- •CommonsenseQA \(CSQA\): Commonsense understanding via multiple\-choice questions\.
- •StrategyQA: Implicit multi\-hop reasoning to answer binary questions\.
- •PIQA: Physical commonsense about plausible everyday actions\.
- •SIQA: Social commonsense and causal reasoning in interactive scenarios\.
- •MuSR: Murder mystery\-style narrative causal inference\.

##### Scientific reasoning\.

- •ARC\-Challenge \(ARC\-C\): Hard multiple\-choice science test questions\.
- •ARC\-Easy \(ARC\-E\): Simpler science test questions\.
- •GPQA: Graduate\-level scientific reasoning questions focused on biology and STEM\.

##### Formal & deductive reasoning\.

- •FOLIO: First\-order logic reasoning problems\.
- •LSAT: Logical reasoning tasks from the Law School Admission Test\.
- •ContextHub\_abductive \(CH\-Abd\): Abductive reasoning from observations to explanations\.
- •ContextHub\_deductive \(CH\-Ded\): Strict deductive reasoning from premises\.

#### B\.1\.1Models Details

We evaluate four instruction\-tuned LLMs spanning both*base*and*reasoning\-enhanced*families to test EDRM’s cross\-model robustness\. All models are used in their public Instruct checkpoints and are decoded with greedy generation for all regimes to ensure comparability\.

##### Evaluated models\.

- •Llama\-3\.1\-8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib12)\): an 8B parameter base instruct model\.
- •Llama\-3\.2\-3B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib12)\): a lightweight 3B parameter base instruct model\.
- •Qwen2\.5\-7B\-InstructHuiet al\.\([2024](https://arxiv.org/html/2605.22873#bib.bib14)\): a 7B parameter base instruct model from the Qwen family\.
- •Qwen3\-4B\-Instruct\-2507Yanget al\.\([2025](https://arxiv.org/html/2605.22873#bib.bib13)\): a 4B parameter*reasoning\-enhanced*model with built\-inthinkbehavior, often producing verbose deliberation by default; we include it as an “over\-reasoning” stress test\.

##### Decoding regimes and model\-specific control\.

We apply the same three regimes \(Direct,Standard,CoT\) across all models \(Section 3\)\. For reasoning\-enhanced models \(Qwen3\-4B\-Instruct\-2507\), we additionally control the internal thinking behavior:

- •Direct: force final\-answer\-only output \(and disable/avoid explicitthinktraces when supported by the model interface\)\.
- •Standard: use a neutral instruction without step\-by\-step cues; for Qwen3 we also disable/avoidthinkso probing reflects minimally perturbed decoding dynamics\.
- •CoT: explicitly request step\-by\-step reasoning; for Qwen3 we enable/allowthinkto realize the intended reasoning mode\.

This design ensures that “CoT” corresponds to increased deliberation budget, while “Direct/Standard” reflect low\-deliberation decoding, which is crucial for a fair assessment of routing under overthinking\-prone models\.

##### Generation and evaluation settings\.

All experiments use greedy decoding \(temperature=0=0\) with identical stopping criteria across regimes\. The maximum generation length is set to 4096 tokens\. For instance\-level routing, the probing phase usesN=64N\{=\}64steps under theStandardregime before the router selects the final decoding mode\.

### B\.2Prompts Design and Decoding Details

This subsection details the prompt templates and decoding configurations for the three regimes \(Direct,Standard,CoT\) across different model families\.

##### Prompt templates by decoding regime\.

We design task\-specific prompts for two categories:Answer tasks\(mathematical reasoning:gsm8k,MultiArith,MATH\) andChoice tasks\(multi\-choice QA: all other benchmarks\)\. Table[5](https://arxiv.org/html/2605.22873#A2.T5)summarizes the prompts for each decoding regime\.

Table 5:Prompt templates for each decoding regime across task types\.RegimePrompt SuffixDirectAnswer: “Your answer must not include any reasoning step\. You must only write your answer directly\. You only output ‘The answer is ¡answer¿’\.”Choice: “Your answer must not include any reasoning\. Write the answer: ‘Answer: ¡Your Answer Letter Choice¿”’StandardBoth: “\{question\}” \(no additional instruction\)CoTBoth: “Let’s think step by step\.”
##### Model\-specific chat templates\.

Table[6](https://arxiv.org/html/2605.22873#A2.T6)shows the official chat templates used for each model family\.

Table 6:Chat templates for different model families\.Model FamilyChat TemplateLlama\-3\.x Instruct<\|begin\_of\_text\|\>\.\.\.user\.\.\.<\|eot\_id\|\>\.\.\.assistant\.\.\.Qwen2\.5 / Qwen3<\|im\_start\|\>user\{prompt\}<\|im\_end\|\>\.\.\.<\|im\_start\|\>assistant
##### Handling reasoning\-enhanced models\.

For models with built\-in thinking capabilities \(e\.g\., Qwen3\-4B\-Instruct\-2507\), we apply additional control via theenable\_thinkingparameter and system prompts, as summarized in Table[7](https://arxiv.org/html/2605.22873#A2.T7)\.

Table 7:Thinking mode control for reasoning\-enhanced models\.Regimeenable\_thinkingAdditional ControlDirect/StandardFalseQwen3\.5: prepend “/no\_think” system promptCoTTrueNone \(allow thinking traces\)This design ensures that the three decoding regimes correspond to distinct levels of deliberation budget:Directforces minimal output,Standardprovides neutral prompting, andCoTexplicitly solicits extended reasoning\.

### B\.3Evaluation Metrics Details

This subsection details the computation of accuracy, token consumption, and consistency metrics used throughout our experiments\.

##### Accuracy and token consumption\.

For a decoding modem∈\{Direct,Standard,CoT\}m\\in\\\{\\textit\{Direct\},\\textit\{Standard\},\\textit\{CoT\}\\\}, we compute dataset\-level accuracy and average token consumption as:

Acc​\(m\)=1\|𝒟\|​∑x∈𝒟𝕀​\[correctm​\(x\)\],AvgTok​\(m\)=1\|𝒟\|​∑x∈𝒟Tm​\(x\),\\text\{Acc\}\(m\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}\\mathbb\{I\}\[\\text\{correct\}\_\{m\}\(x\)\],\\qquad\\text\{AvgTok\}\(m\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}T\_\{m\}\(x\),\(14\)where𝕀​\[correctm​\(x\)\]∈\{0,1\}\\mathbb\{I\}\[\\text\{correct\}\_\{m\}\(x\)\]\\in\\\{0,1\\\}indicates whether modemmproduces a correct answer for instancexx, andTm​\(x\)T\_\{m\}\(x\)denotes the number of generated tokens\.

##### Dataset\-level routing \(EDRM\-Global\)\.

When routing selects modem∗​\(x\)m^\{\*\}\(x\)for each instance, the effective accuracy and token cost are:

AccGlobal=1\|𝒟\|​∑x∈𝒟𝕀​\[correctm∗​\(x\)​\(x\)\],AvgTokGlobal=1\|𝒟\|​∑x∈𝒟Tm∗​\(x\)​\(x\)\.\\text\{Acc\}\_\{\\text\{Global\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}\\mathbb\{I\}\[\\text\{correct\}\_\{m^\{\*\}\(x\)\}\(x\)\],\\qquad\\text\{AvgTok\}\_\{\\text\{Global\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}T\_\{m^\{\*\}\(x\)\}\(x\)\.\(15\)Since EDRM\-Global probes only a small calibration subset \(n=50n\{=\}50samples,<5<5tokens per instance on average\), we do not include probing overhead in the reported token consumption\.

##### Instance\-level routing \(EDRM\-Inst\)\.

Instance\-level routing requires probing every instance, so we strictly account for all token costs\. For an instancexxrouted to modem∗​\(x\)m^\{\*\}\(x\), the total token cost is:

Ttotal​\(x\)=Tm∗​\(x\)​\(x\)\+Tprobe​\(x\)\+Tfallback​\(x\),T\_\{\\text\{total\}\}\(x\)=T\_\{m^\{\*\}\(x\)\}\(x\)\+T\_\{\\text\{probe\}\}\(x\)\+T\_\{\\text\{fallback\}\}\(x\),\(16\)where:

- •Tprobe​\(x\)=NT\_\{\\text\{probe\}\}\(x\)=Nifm∗​\(x\)≠Standardm^\{\*\}\(x\)\\neq\\textit\{Standard\}\(probing tokens are consumed when Standard is not selected\), otherwiseTprobe​\(x\)=0T\_\{\\text\{probe\}\}\(x\)=0;
- •Tfallback​\(x\)=TDirect​\(x\)T\_\{\\text\{fallback\}\}\(x\)=T\_\{\\textit\{Direct\}\}\(x\)ifm∗​\(x\)≠Directm^\{\*\}\(x\)\\neq\\textit\{Direct\}\(fallback compensation adds Direct branch\), otherwiseTfallback​\(x\)=0T\_\{\\text\{fallback\}\}\(x\)=0\.

The effective accuracy incorporates fallback compensation: ifm∗​\(x\)≠Directm^\{\*\}\(x\)\\neq\\textit\{Direct\}, the instance is considered correct when eitherm∗​\(x\)m^\{\*\}\(x\)orDirectproduces a correct answer\. Formally:

𝕀​\[correctInst​\(x\)\]=\{𝕀​\[correctm∗​\(x\)​\(x\)\]∨𝕀​\[correctDirect​\(x\)\],m∗​\(x\)≠Direct𝕀​\[correctDirect​\(x\)\],m∗​\(x\)=Direct\\mathbb\{I\}\[\\text\{correct\}\_\{\\text\{Inst\}\}\(x\)\]=\\begin\{cases\}\\mathbb\{I\}\[\\text\{correct\}\_\{m^\{\*\}\(x\)\}\(x\)\]\\lor\\mathbb\{I\}\[\\text\{correct\}\_\{\\textit\{Direct\}\}\(x\)\],&m^\{\*\}\(x\)\\neq\\textit\{Direct\}\\\\ \\mathbb\{I\}\[\\text\{correct\}\_\{\\textit\{Direct\}\}\(x\)\],&m^\{\*\}\(x\)=\\textit\{Direct\}\\end\{cases\}\(17\)The dataset\-level metrics are then:

AccInst=1\|𝒟\|​∑x∈𝒟𝕀​\[correctInst​\(x\)\],AvgTokInst=1\|𝒟\|​∑x∈𝒟Ttotal​\(x\)\.\\text\{Acc\}\_\{\\text\{Inst\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}\\mathbb\{I\}\[\\text\{correct\}\_\{\\text\{Inst\}\}\(x\)\],\\qquad\\text\{AvgTok\}\_\{\\text\{Inst\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\}T\_\{\\text\{total\}\}\(x\)\.\(18\)

##### Routing consistency\.

For experiments requiring multiple trials \(8 random seeds\{0,1,2,3,11,12,13,14\}\\\{0,1,2,3,11,12,13,14\\\}\), we report the routing decision distribution as the ratioD:S:CD:S:C, whereDD,SS, andCCdenote the number of trials \(out of 8\) that selectDirect,Standard, andCoTrespectively\. This ratio directly reflects the stability of dataset\-level routing decisions: a concentrated distribution \(e\.g\.,8:0:08:0:0\) indicates high consistency across random seeds, while a dispersed distribution \(e\.g\.,2:3:32:3:3\) indicates sensitivity to sampling variation\.

### B\.4Additional EDRM\-Global Results

##### EDRM\-Global with Llama\-3\.1\-8B\-Instruct\.

The following table presents the comprehensive performance comparison of EDRM\-Global \(both E&C variants\) against Direct, Standard, and CoT decoding strategies for theLlama\-3\.1\-8B\-Instructmodel across all 15 benchmarks\. The results include accuracy percentages, average token consumption, and the distribution of 8 time routing decisions \(Direct: D, Standard: S, CoT: C\) for each dataset\. The analysis highlights that both EDRM\-Global variants achieve an overall accuracy of 78\.23% with an average token consumption of 335\.1, representing a significant reduction in token cost by 47\.8% compared to CoT \(642\.5 tokens\) while improving accuracy by 0\.50 percentage points\. Notably, all per\-dataset accuracy variances are below1\.6×10−31\.6\\times 10^\{\-3\}, indicating consistent performance across different benchmarks\.

Table 8:EDRM\-Global\(E&C\) Performance comparison ofLlama\-3\.1\-8B\-Instructacross 15 benchmarks\. Both EDRM variants achieve 78\.23% accuracy with 335\.1 average tokens, reducing token cost by 47\.8% compared to CoT \(642\.5 tokens\) while improving accuracy by 0\.50 percentage points\. All per\-dataset accuracy variances are below1\.6×10−31\.6\\times 10^\{\-3\}\.DatasetDirectStandardCoTEDRM\-Global\-EEDRM\-Global\-CAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CARC\-C90\.786\.394\.37254\.894\.45416\.290\.786\.38:0:090\.786\.38:0:0ARC\-E96\.095\.997\.98190\.198\.32344\.996\.095\.98:0:096\.095\.98:0:0BBH57\.238\.282\.93556\.484\.09664\.663\.66145\.36:2:063\.66145\.36:2:0CSQA75\.926\.380\.43256\.280\.43414\.575\.926\.38:0:075\.926\.38:0:0CH\-Abd29\.968\.165\.00869\.963\.83943\.263\.83943\.20:0:863\.83943\.20:0:8CH\-Ded54\.006\.282\.46634\.482\.87676\.682\.87676\.60:0:882\.87676\.60:0:8FOLIO60\.714\.777\.41816\.776\.83865\.376\.83865\.30:0:876\.83865\.30:0:8GPQA38\.176\.548\.882611\.347\.992762\.747\.542285\.71:7:047\.542285\.71:7:0GSM8K27\.227\.590\.22321\.890\.83339\.190\.68334\.80:2:690\.68334\.80:2:6LSAT66\.704\.882\.161420\.383\.051550\.776\.36889\.53:5:076\.36889\.53:5:0MultiArith81\.835\.797\.83107\.997\.83129\.597\.83116\.00:5:397\.83116\.00:5:3MuSR58\.335\.757\.801059\.362\.831401\.758\.335\.78:0:058\.335\.78:0:0PIQA83\.355\.084\.71167\.988\.14425\.483\.355\.08:0:083\.355\.08:0:0SIQA71\.905\.674\.87239\.877\.02398\.671\.905\.68:0:071\.905\.68:0:0StratQA81\.095\.177\.82206\.574\.50326\.181\.095\.18:0:081\.095\.18:0:0Overall65\.636\.280\.96518\.481\.35642\.578\.23335\.1–78\.23335\.1–

##### EDRM\-Global with Qwen2\.5\-7B\-Instruct\.

The following table presents the comprehensive performance comparison of EDRM\-Global \(both E&C variants\) against Direct, Standard, and CoT decoding strategies for theQwen2\.5\-7B\-Instructmodel across all 15 benchmarks\. The results include accuracy percentages, average token consumption, and the distribution of 8 time routing decisions \(Direct: D, Standard: S, CoT: C\) for each dataset\. The analysis highlights that both EDRM\-Global variants achieve an overall accuracy of 74\.01% with an average token consumption of 191\.8, representing a significant reduction in token cost by 42\.0% compared to CoT \(330\.3 tokens\) while improving accuracy by 0\.63 percentage points\. Notably, all per\-dataset accuracy variances are below5\.0×10−45\.0\\times 10^\{\-4\}, indicating consistent performance across different benchmarks\.

Table 9:EDRM\-Global\(E&C\) Performance comparison ofQwen2\.5\-7B\-Instructacross 15 benchmarks\. Both EDRM variants achieve 74\.01% accuracy with 191\.8 average tokens, reducing token cost by 42\.0% compared to CoT \(330\.3 tokens\) while improving accuracy by 0\.63 percentage points\. All per\-dataset accuracy variances are below5\.0×10−45\.0\\times 10^\{\-4\}\.DatasetDirectStandardCoTEDRM\-Global\-EEDRM\-Global\-CAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CARC\-C90\.443\.090\.87198\.391\.72289\.790\.4927\.47:1:090\.443\.08:0:0ARC\-E95\.753\.096\.51157\.896\.21270\.595\.753\.08:0:095\.753\.08:0:0BBH53\.913\.468\.47223\.771\.07330\.470\.10290\.40:3:570\.10290\.40:3:5CSQA81\.083\.081\.00120\.680\.84252\.381\.083\.08:0:081\.083\.08:0:0CH\-Abd34\.213\.152\.08375\.552\.71432\.552\.71432\.50:0:852\.71432\.50:0:8CH\-Ded46\.923\.053\.21337\.252\.88365\.852\.88365\.80:0:852\.88365\.80:0:8FOLIO53\.823\.067\.61356\.569\.44398\.469\.44398\.40:0:869\.44398\.40:0:8GPQA39\.513\.036\.16670\.929\.91806\.335\.38687\.80:7:135\.38687\.80:7:1GSM8K22\.596\.689\.76292\.889\.23299\.989\.36298\.20:2:689\.36298\.20:2:6LSAT61\.353\.156\.39492\.157\.68545\.957\.68545\.90:0:857\.68545\.90:0:8MultiArith61\.175\.898\.83174\.798\.83182\.398\.83182\.30:0:898\.83182\.30:0:8MuSR54\.893\.053\.84182\.352\.78384\.854\.893\.08:0:054\.893\.08:0:0PIQA86\.623\.482\.43119\.487\.32235\.586\.623\.48:0:086\.623\.48:0:0SIQA73\.393\.172\.67101\.974\.46261\.773\.393\.18:0:073\.393\.18:0:0StratQA85\.813\.080\.83151\.477\.95259\.785\.813\.08:0:085\.813\.08:0:0Overall64\.633\.472\.89240\.673\.38330\.374\.01191\.8–74\.01190\.6–

##### EDRM\-Global with Qwen3\-4B\-Instruct\-2507\.

The following table presents the comprehensive performance comparison of EDRM\-Global \(both E&C variants\) against Direct, Standard, and CoT decoding strategies for theQwen3\-4B\-Instruct\-2507model across all 15 benchmarks\. The results include accuracy percentages, average token consumption, and the distribution of 8 time routing decisions \(Direct: D, Standard: S, CoT: C\) for each dataset\. The analysis highlights that both EDRM\-Global variants achieve an overall accuracy of 68\.48% with an average token consumption of 164\.4, representing a significant reduction in token cost by 40\.8% compared to CoT \(277\.8 tokens\) while improving accuracy by 0\.35 percentage points\. Notably, all per\-dataset accuracy variances are below5\.5×10−35\.5\\times 10^\{\-3\}, indicating consistent performance across different benchmarks\.

Table 10:EDRM\-Global\(E&C\) Performance comparison ofQwen3\-4B\-Instruct\-2507across 15 benchmarks\. Both EDRM variants achieve 68\.48% accuracy with 164\.4 average tokens, reducing token cost by 40\.8% compared to CoT \(277\.8 tokens\) while improving accuracy by 0\.35 percentage points\. All per\-dataset accuracy variances are below5\.5×10−35\.5\\times 10^\{\-3\}\.DatasetDirectStandardCoTEDRM\-Global\-EEDRM\-Global\-CAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CAcc¯\\overline\{\\text\{Acc\}\}\(%\)Tok¯\\overline\{\\text\{Tok\}\}D:S:CARC\-C81\.064\.081\.1429\.786\.60270\.181\.1010\.46:2:081\.1016\.94:4:0ARC\-E92\.764\.092\.7216\.194\.19235\.293\.0969\.41:5:293\.0969\.41:5:2BBH54\.294\.060\.95149\.565\.33303\.264\.78284\.00:1:764\.78284\.00:1:7CSQA74\.044\.071\.9947\.274\.61230\.174\.044\.08:0:074\.044\.08:0:0CH\-Abd31\.044\.039\.21240\.039\.92284\.339\.92284\.30:0:839\.92284\.30:0:8CH\-Ded36\.334\.054\.13230\.356\.04240\.154\.85234\.00:5:354\.85234\.00:5:3FOLIO48\.424\.052\.24245\.556\.98337\.756\.98337\.70:0:856\.98337\.70:0:8GPQA36\.1611\.533\.71327\.227\.011005\.831\.20581\.70:5:331\.20581\.70:5:3GSM8K14\.8616\.884\.46227\.282\.64234\.884\.46227\.20:8:084\.46227\.20:8:0LSAT50\.744\.050\.74213\.749\.85491\.750\.63222\.21:6:150\.63222\.21:6:1MultiArith42\.507\.098\.83116\.998\.00125\.798\.83116\.90:8:098\.83116\.90:8:0MuSR51\.854\.050\.26165\.651\.32291\.751\.854\.08:0:051\.0684\.84:4:0PIQA81\.994\.082\.5942\.983\.19216\.882\.89146\.71:2:582\.89146\.71:2:5SIQA70\.324\.070\.0148\.370\.68214\.470\.324\.08:0:070\.324\.08:0:0StratQA79\.614\.084\.1915\.669\.43255\.777\.51102\.82:3:377\.51102\.82:3:3Overall58\.994\.968\.21127\.168\.07277\.868\.48164\.4–68\.46167\.4–

### B\.5Additional EDRM\-Instance Results

This subsection presents instance\-level routing results for the remaining three models:Llama\-3\.1\-8B\-Instruct,Qwen2\.5\-7B\-Instruct, andQwen3\-4B\-Instruct\-2507\.

##### EDRM\-Instance with Llama\-3\.1\-8B\-Instruct\.

Table[11](https://arxiv.org/html/2605.22873#A2.T11)presents the instance\-level routing performance onLlama\-3\.1\-8B\-Instruct\. EDRM\-MLP achieves 72\.27% accuracy with 149\.9 tokens, outperforming CoT \(68\.07% accuracy, 277\.8 tokens\) by \+4\.20 percentage points while reducing token consumption by 46\.0%\. Notably, EDRM\-Inst\-E achieves 71\.97% accuracy, demonstrating strong performance with the training\-free heuristic router\.

Table 11:EDRM instance\-level Performance onLlama\-3\.1\-8B\-Instruct\. EDRM\-MLP achieves 72\.27% accuracy with 149\.9 tokens, outperforming CoT by \+4\.20% while reducing tokens by 46\.0%\.BaselinesOursDatasetDirectStandardCoTEDRM\-Inst\-EEDRM\-Inst\-CEDRM\-MLPAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\mathrm\{Acc\}\}\(%\)Tok¯\\overline\{\\mathrm\{Tok\}\}Acc\(%\)TokARC\-C81\.064\.081\.1429\.786\.60270\.184\.5632\.884\.5132\.884\.5632\.9ARC\-E92\.764\.092\.7216\.194\.19235\.293\.9420\.293\.9420\.293\.8620\.3BBH54\.294\.060\.95149\.565\.33303\.269\.53213\.169\.50213\.267\.50166\.2CSQA74\.044\.071\.9947\.274\.61230\.175\.5156\.575\.5356\.576\.0046\.7CH\-Abd31\.044\.039\.21240\.039\.92284\.350\.50285\.050\.50285\.051\.12292\.4CH\-Ded36\.334\.054\.12230\.356\.04240\.153\.21222\.853\.21222\.456\.17253\.9FOLIO48\.424\.052\.24245\.556\.98337\.763\.95323\.263\.90323\.061\.79291\.5GPQA36\.1611\.533\.71327\.227\.011005\.843\.08363\.443\.16362\.641\.96423\.3GSM8K14\.8616\.884\.46227\.282\.64234\.876\.57245\.476\.54245\.282\.71278\.6LSAT50\.744\.050\.74213\.749\.85491\.758\.28341\.258\.28341\.657\.19278\.2MultiArith42\.507\.098\.83116\.998\.00125\.795\.50156\.195\.54156\.192\.33149\.0MuSR51\.854\.050\.26165\.651\.32291\.756\.08124\.056\.10124\.658\.33203\.9PIQA81\.994\.082\.5942\.983\.19216\.886\.78103\.586\.78103\.584\.9338\.7SIQA70\.324\.070\.0148\.370\.68214\.471\.6545\.071\.6645\.072\.8258\.8StratQA79\.614\.084\.1915\.669\.43255\.785\.9820\.785\.9820\.685\.7217\.0Overall59\.005\.068\.21127\.168\.07277\.871\.97153\.971\.01170\.272\.27149\.9

##### EDRM\-Instance with Qwen2\.5\-7B\-Instruct\.

Table[12](https://arxiv.org/html/2605.22873#A2.T12)presents the instance\-level routing performance onQwen2\.5\-7B\-Instruct\. EDRM\-Inst\-E achieves the best accuracy \(78\.11%\) with 242\.7 tokens, outperforming CoT \(73\.38% accuracy, 330\.3 tokens\) by \+4\.73 percentage points while reducing token consumption by 26\.5%\. EDRM\-MLP achieves 77\.93% accuracy with the best token efficiency \(240\.4 tokens\) among instance\-level variants\.

Table 12:EDRM instance\-level Performance onQwen2\.5\-7B\-Instruct\. EDRM\-Inst\-E achieves 78\.11% accuracy, outperforming CoT by \+4\.73% while reducing tokens by 26\.5%\.BaselinesOursDatasetDirectStandardCoTEDRM\-Inst\-EEDRM\-Inst\-CEDRM\-MLPAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\mathrm\{Acc\}\}\(%\)Tok¯\\overline\{\\mathrm\{Tok\}\}Acc\(%\)TokARC\-C90\.443\.090\.87198\.391\.72289\.792\.41130\.892\.41129\.092\.15147\.4ARC\-E95\.753\.096\.51157\.896\.21270\.596\.42104\.896\.39103\.096\.76147\.6BBH53\.913\.468\.47223\.771\.07330\.474\.54252\.174\.40251\.873\.53239\.4CSQA81\.083\.081\.00120\.680\.84252\.381\.8293\.381\.8292\.281\.9089\.0CH\-Abd34\.213\.152\.08375\.552\.71432\.561\.92463\.661\.85462\.560\.88445\.5CH\-Ded46\.923\.053\.21337\.252\.88365\.865\.25377\.965\.02373\.964\.79366\.9FOLIO53\.823\.067\.61356\.569\.44398\.474\.00410\.273\.82409\.374\.50419\.4GPQA39\.513\.036\.16670\.929\.91806\.348\.21579\.247\.99574\.845\.31461\.3GSM8K22\.596\.689\.76292\.889\.23299\.987\.11327\.987\.02327\.690\.30361\.4LSAT61\.353\.156\.39492\.157\.68545\.968\.68464\.868\.68464\.665\.91333\.5MultiArith61\.175\.898\.83174\.798\.83182\.397\.67210\.997\.67210\.999\.17247\.0MuSR54\.893\.053\.84182\.352\.78384\.856\.48101\.356\.48100\.557\.01110\.0PIQA86\.623\.482\.43119\.487\.32235\.587\.70105\.187\.69104\.888\.14125\.6SIQA73\.393\.172\.67101\.974\.46261\.775\.0392\.574\.9591\.974\.72100\.1StratQA85\.813\.080\.83151\.477\.95259\.787\.47133\.787\.47132\.586\.90124\.0Overall64\.633\.472\.89240\.673\.38330\.378\.11242\.776\.91255\.377\.93240\.4

##### EDRM\-Instance with Qwen3\-4B\-Instruct\-2507\.

Table[13](https://arxiv.org/html/2605.22873#A2.T13)presents the instance\-level routing performance on the reasoning\-enhancedQwen3\-4B\-Instruct\-2507\. Despite the model’s inherent over\-reasoning bias, EDRM effectively suppresses redundant deliberation\. EDRM\-MLP achieves 81\.20% accuracy with 424\.8 tokens, nearly matching CoT’s 81\.35% accuracy while reducing token consumption by 33\.9% \(from 642\.5 to 424\.8 tokens\)\. EDRM\-Inst\-E achieves 80\.29% with 401\.1 tokens, demonstrating a 37\.6% token reduction\.

Table 13:EDRM instance\-level Performance onQwen3\-4B\-Instruct\-2507\. EDRM\-MLP achieves 81\.20% accuracy with 424\.8 tokens, reducing token cost by 33\.9% compared to CoT \(642\.5 tokens\) while maintaining comparable accuracy\.BaselinesOursDatasetDirectStandardCoTEDRM\-Inst\-EEDRM\-Inst\-CEDRM\-MLPAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc\(%\)TokAcc¯\\overline\{\\mathrm\{Acc\}\}\(%\)Tok¯\\overline\{\\mathrm\{Tok\}\}Acc\(%\)TokARC\-C90\.786\.394\.37254\.894\.45416\.292\.15114\.392\.16123\.393\.60198\.2ARC\-E96\.095\.997\.98190\.198\.32344\.996\.5596\.496\.71100\.097\.39150\.2BBH57\.238\.282\.93556\.484\.09664\.677\.82403\.278\.12409\.182\.45490\.0CSQA75\.926\.380\.43256\.280\.43414\.576\.6694\.576\.8897\.879\.52176\.3CH\-Abd29\.968\.165\.00869\.963\.83943\.265\.75887\.965\.99891\.063\.00754\.2CH\-Ded54\.006\.282\.46634\.482\.87676\.684\.50674\.984\.51675\.882\.17590\.1FOLIO60\.714\.777\.41816\.776\.83865\.379\.65809\.179\.74816\.879\.49761\.3GPQA38\.176\.548\.882611\.347\.992762\.751\.561790\.251\.951827\.056\.251934\.4GSM8K27\.227\.590\.22321\.890\.83339\.183\.70337\.184\.48346\.185\.97331\.3LSAT66\.704\.882\.161420\.383\.051550\.778\.39863\.778\.46877\.780\.48998\.3MultiArith81\.835\.797\.83107\.997\.83129\.595\.17142\.695\.17142\.696\.17132\.5MuSR58\.335\.757\.801059\.362\.831401\.760\.85315\.260\.85318\.463\.23700\.7PIQA83\.355\.084\.71167\.988\.14425\.484\.28122\.684\.28122\.885\.42144\.5SIQA71\.905\.674\.87239\.877\.02398\.674\.00135\.774\.25139\.175\.08157\.3StratQA81\.095\.177\.82206\.574\.50326\.181\.83121\.581\.83124\.483\.10148\.3Overall65\.636\.280\.96518\.481\.35642\.580\.29401\.179\.03467\.581\.20424\.8

##### Summary of cross\-model instance\-level performance\.

Across all four models, instance\-level EDRM variants consistently outperform static baselines\. On base models \(Llama\-3\.2\-3B, Llama\-3\.1\-8B, Qwen2\.5\-7B\), EDRM achieves accuracy gains of \+3\.5% to \+5\.6% over CoT while reducing token consumption by 27–46%\. On the reasoning\-enhanced Qwen3\-4B, EDRM maintains comparable accuracy \(81\.20% vs\. 81\.35%\) while achieving 33–38% token reduction, demonstrating robust routing even when models are biased toward verbose deliberation\.

### B\.6Ablation Study on EDRM\-MLP Variants

We conduct comprehensive ablation studies onEDRM\-MLPacross two design dimensions: \(i\)input representation\(3D statistical descriptors, 64D entropy trajectory, or 67D hybrid concatenation\), and \(ii\)label strategy\(original multi\-label vs\. priority\-constrained single\-label\)\. Table[14](https://arxiv.org/html/2605.22873#A2.T14)reports average accuracy and token consumption under theboostsetting \(routing decision applied\) across 16 reasoning benchmarks\.

Table 14:EDRM\-MLP boost variants comparison \(average accuracy and token consumption across 16 benchmarks\)\. Each variant reportsAcc/Token\. Best accuracy per model isbolded; lowest token count isunderlined\.ModelML\-3DML\-64DML\-67DSL\-3DSL\-64DSL\-67DAccTokenAccTokenAccTokenAccTokenAccTokenAccTokenLlama\-3\.2\-3B0\.6492162\.380\.6664178\.890\.6601175\.480\.6574171\.450\.6628173\.220\.6484145\.29Llama\-3\.1\-8B0\.7327158\.170\.7227149\.900\.7212138\.140\.7359151\.830\.7307134\.730\.7097113\.59Qwen2\.5\-7B0\.7662198\.130\.7793240\.450\.7804223\.750\.7731188\.210\.7517152\.740\.7486146\.02Avg\.0\.7160172\.890\.7228189\.750\.7206179\.120\.7221170\.500\.7151153\.560\.7022134\.97

##### Key findings\.

- •Input dimensionality exhibits model\-dependent scaling\.For mid\-scale models \(Qwen2\.5\-7B\), richer representations consistently outperform compact descriptors: ML\-67D achieves the highest accuracy \(0\.7804\), exceeding ML\-3D by \+1\.42%\. Conversely, for Llama\-3\.1\-8B, the lightweight 3D input yields the best result \(SL\-3D: 0\.7359\), suggesting that stronger backbones can extract sufficient routing signals from statistical summaries alone\. This reveals acapacity\-efficiency trade\-off: trajectory\-level features benefit capacity\-limited routers, while concise descriptors suffice for larger models\.
- •Label strategy interacts with model scale and input form\.The priority\-constrained single\-label formulation \(SL\) yields more stable gains on smaller models, where simplifying the prediction target reduces optimization ambiguity\. However, on Qwen2\.5\-7B with high\-dimensional inputs \(64D/67D\), the original multi\-label target \(ML\) slightly outperforms SL, implying that larger routers can effectively leverage richer supervision without confusion from label correlations\.
- •Hybrid 67D features do not universally dominate\.While 67D achieves the overall best accuracy \(Qwen2\.5\-7B \+ ML\-67D: 0\.7804\), it underperforms 64D or 3D in several configurations \(e\.g\., Llama\-3\.1\-8B \+ SL\-67D drops to 0\.7097\)\. We hypothesize that naively concatenating statistical and trajectory features may introduce redundancy or optimization interference for certain model\-label combinations\. Future work could explore adaptive feature gating or attention\-based fusion to better integrate heterogeneous signals\.
- •Token efficiency favors compact representations\.Across all models, SL\-67D consistently achieves the lowest token consumption \(avg\. 134\.97\), while ML\-64D incurs the highest overhead \(avg\. 189\.75\)\. This suggests that single\-label training encourages more decisive routing behavior, reducing unnecessary fallback to expensive decoding regimes\.

##### Practical recommendation\.

Considering both accuracy and computational efficiency, we recommendSL\-3Das the default configuration for resource\-constrained deployments: it achieves near\-optimal average accuracy \(0\.7221\) with moderate token overhead \(170\.50\)\. For scenarios prioritizing absolute accuracy with mid\-scale models \(e\.g\., 7B class\),ML\-67Dprovides the highest ceiling \(0\.7804\) at the cost of∼\\sim30% additional tokens\.

##### Limitations and future directions\.

Our ablation focuses on fixed\-length \(N=64N\{=\}64\) entropy trajectories and static feature concatenation\. Future extensions could explore: \(i\) adaptive trajectory length selection based on instance difficulty; \(ii\) learnable feature fusion mechanisms \(e\.g\., cross\-attention between 3D and 64D branches\); \(iii\) joint optimization of router and decoder to enable end\-to\-end adaptation of entropy dynamics\.

### B\.7Hyper\-parameter sensitivitykkforVsp/avnrV\_\{\\text\{sp\}\}/a\_\{\\text\{vnr\}\}control

The follow experiments is all on EDRM\-Global\-E\.

Table 15:Routing accuracy and token consumption with differentkkvalues \(weighted mean across 16 benchmarks\)\. Eachkksetting reportsAccuracy/Avg\. Tokens\. Best accuracy per model isbolded; lowest token count isunderlined\.Modelk=0\.10k\{=\}0\.10k=0\.07k\{=\}0\.07k=0\.05k\{=\}0\.05k=0\.03k\{=\}0\.03AccTokenAccTokenAccTokenAccTokenLlama\-3\.1\-8B0\.6872123\.880\.6931138\.730\.6906178\.600\.6760202\.45Llama\-3\.2\-3B0\.6201110\.080\.6181107\.680\.6210114\.570\.6210114\.57Qwen2\.5\-7B0\.7390183\.760\.7410193\.970\.7398196\.600\.7398196\.60Qwen3\-4B0\.7791351\.420\.7794352\.400\.7797355\.900\.7793361\.61Avg\.0\.7064192\.290\.7079198\.200\.7078211\.420\.7040218\.81

Table[15](https://arxiv.org/html/2605.22873#A2.T15)shows the trade\-off between routing accuracy and token efficiency across fourkkthresholds\. Two key patterns emerge:

\(1\) Accuracy peaks at moderatekk\.Across all models, routing accuracy is maximized atk∈\{0\.07,0\.05\}k\{\\in\}\\\{0\.07,0\.05\\\}\(avg\. 0\.7079/0\.7078\), while extreme values \(k=0\.10k\{=\}0\.10or0\.030\.03\) yield slight degradation\. This confirms that moderate routing sensitivity best balancesselectivity\(routing only high\-confidence instances\) andcoverage\(leveraging routing benefits broadly\)\.

\(2\) Token savings scale monotonically withkk\.Smallerkkvalues trigger routing more conservatively, resulting in higher token consumption \(e\.g\., avg\. 218\.8 tokens atk=0\.03k\{=\}0\.03vs\. 192\.3 atk=0\.10k\{=\}0\.10\)\. However, the accuracy gain fromk=0\.10→0\.07k\{=\}0\.10\\rightarrow 0\.07\(\+0\.15%\) outweighs the modest token increase \(\+5\.9 tokens\), justifying our default choice ofk=0\.07k\{=\}0\.07\.

Model\-specific insights:

- •Smaller models\(3B–7B\) benefit more from routing: accuracy improves by \+0\.7–1\.3% with 40–55% token reduction\.
- •Larger models\(4B\+\) show diminishing accuracy returns but still achieve∼\\sim45% token savings, making routing valuable for latency\-sensitive deployment\.

Similar Articles

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.