HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

arXiv cs.AI Papers

Summary

This paper proposes HyperGuide, a method that distills reasoning progress into a hyperbolic geometric signal to guide step-by-step generation in LLMs, improving multi-step reasoning efficiency without explicit tree search.

arXiv:2605.24140v1 Announce Type: new Abstract: Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:05 AM

# HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
Source: [https://arxiv.org/html/2605.24140](https://arxiv.org/html/2605.24140)
Yuyu Liu Department of Computer Science Stony Brook University &Haotian Xu Department of Applied Mathematics and Statistics Stony Brook University &Yanan He Department of Computer Science Yale University &Sarang Rajendra Patil Department of Data Science New Jersey Institute of Technology &Mengjia Xu Department of Data Science New Jersey Institute of Technology &Tengfei Ma Department of Biomedical Informatics Stony Brook University

###### Abstract

Multi\-step reasoning remains a central challenge for large language models: single\-pass generation is efficient but lacks accuracy; tree\-search methods explore multiple paths but are computation\-heavy\. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step\-by\-step generation\. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution\-bearing states are few while dead ends are exponentially numerous\. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance\-to\-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations\. We train a lightweight head to project LLM hidden states into this space, then fine\-tune a low\-rank adapter interactively on its own reasoning attempts to act on the injected signal\. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains\. Our code is publicly available at[https://github\.com/yuyuliu11037/HyperGuide](https://github.com/yuyuliu11037/HyperGuide)\.

*K*eywordsHyperbolic Embeddings⋅\\cdotLLM Reasoning⋅\\cdotImitation Learning

## 1Introduction

Large language models \(LLMs\) have emerged as general\-purpose problem solvers, demonstrating broad competence on tasks ranging from mathematical reasoning to code synthesis and long\-horizon planningBrownet al\.\([2020](https://arxiv.org/html/2605.24140#bib.bib36)\); Lightmanet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib179)\); Jianget al\.\([2026](https://arxiv.org/html/2605.24140#bib.bib77)\); Valmeekamet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib33)\)\. A common thread underlying many of these advances is*multi\-step reasoning*: composing sequences of intermediate inferences to reach conclusions that no single forward pass could produce\. Reliably and efficiently producing such chains remains a central challenge\. Single\-pass methods such as chain\-of\-thought promptingWeiet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib187)\)are cheap but yield low accuracy; tree search methods such as Tree of ThoughtsYaoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib186)\)and Reasoning via PlanningHaoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib70)\)improve performance by exploring multiple reasoning paths, but require many LLM forward passes\.

This tradeoff, however, is not intrinsic\. The accuracy advantage of tree search can be largely attributed to one kind of information that single\-pass generation typically lacks: estimation of distance from reasoning state to the correct solution\. Here, a*state*denotes a node in the reasoning tree, i\.e\., a partial reasoning trajectory\. The key question is whether this proximity signal can be injected directly into single\-pass generation, sparing the cost of explicit search\. We argue that it can, because the state distribution possesses an asymmetric structure that makes proximity information geometrically easy to encode: a small number of productive states lie on paths leading to correct solutions, and each typically branches into multiple solution paths\. The vast majority of states are dead ends from which no sequence of operations reaches the goal\. For example,99\.4%99\.4\\%of terminal leaves in Game\-of\-24 search trees are dead ends, and70\.9%70\.9\\%of ProntoQA rule applications fail to advance toward the target conclusion\.111Per\-task statistics for all benchmarks are reported in Appendix Table[7](https://arxiv.org/html/2605.24140#A2.T7)\.

Hyperbolic space is a natural fit for this asymmetry\. In this geometry, volume grows exponentially toward the boundaryNickel and Kiela \([2017](https://arxiv.org/html/2605.24140#bib.bib364),[2018](https://arxiv.org/html/2605.24140#bib.bib290)\): the region near the origin is compact while the periphery provides exponentially expanding capacity\. This matches the cardinality structure of the reachable states: the few solution\-bearing states need only the small central region, while the exponentially many dead ends require the boundary volume for adequate separation\. Distance\-to\-origin then serves directly as a continuous proxy for solution proximity\. At the same time, the exponential surface area at each radius provides angular capacity to separate structurally distinct branches, so states with similar proximity but different next\-step requirements remain distinguishable\.

Building on this observation, we propose a pipeline that separates learning the geometric signal from learning to act on it\. In the first stage, we train a lightweight projection head that maps the frozen LLM’s hidden states into hyperbolic space, so that the geometric signal is meaningful\. In the second stage, we fine\-tune a low\-rank adapter to select next\-step operations conditioned on the injected signal, training interactively on the adapter’s own reasoning attempts, so that it learns to use the signal at the states it will actually encounter during generation\. At inference time, each step boundary incurs only the cost of one forward pass through the small projection head\.

We evaluate on a suite of reasoning benchmarks spanning arithmetic, classical planning, constraint satisfaction, and multi\-hop deductive logic\. Because the LoRA adapter is task\-agnostic, a single adapter transfers across related tasks when paired with a cheaply retrained task\-specific head\. The main contributions of our paper are threefold:

1. 1\.We identify a structural correspondence between the solution\-proximity and hyperbolic geometry, and show that both can be encoded as a single geometric primitive injected directly into the language model’s generation stream at each reasoning step\.
2. 2\.We propose a two\-stage pipeline that teaches the model to act on the injected geometric signal during single\-pass generation without invoking search at inference time\.
3. 3\.Across multiple benchmarks and three open\-weight backbones, our method delivers consistent accuracy gains at substantially lower inference cost than search\-based baselines, with larger improvements on deeper reasoning chains\.

## 2Related Work

### 2\.1LLM Reasoning

Single\-pass prompting methods such as Chain\-of\-ThoughtWeiet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib187)\), Self\-ConsistencyWanget al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib185)\), and Least\-to\-MostZhouet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib184)\)cannot revise early mistakes, while search\-based methods such as Tree of ThoughtsYaoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib186)\)and Graph of ThoughtsBestaet al\.\([2024](https://arxiv.org/html/2605.24140#bib.bib53)\)recover lookahead at substantial inference cost\. A parallel line scores intermediate steps with learned verifiersLightmanet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib179)\); Wanget al\.\([2024a](https://arxiv.org/html/2605.24140#bib.bib52)\); Uesatoet al\.\([2022](https://arxiv.org/html/2605.24140#bib.bib51)\)to guide beam or best\-of\-nndecoding, still requiring a separate scoring head and multiple candidate expansions, while reasoning\-tuned models such as DeepSeek\-R1DeepSeek\-AIet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib73)\)instead internalise long reasoning traces via reinforcement learning\. A separate line bypasses discrete decoding and reasons in a continuous latent space: CoconutHaoet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib20)\)feeds the last hidden state back as the next input embedding, CODIShenet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib21)\)compresses explicit CoT into continuous thoughts via self\-distillation, and SoftCoTXuet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib22)\)injects soft thought tokens from a fixed assistant model into the LLM’s representation space\. We target the same bottleneck but distil the search tree’s*state distribution*as a geometric signal the model consults in\-line, with no separate reward model or multi\-candidate expansion at inference\.

### 2\.2Search Distillation

Distilling planning into a reactive policy has a long history in reinforcement learning: the AlphaGo familySilveret al\.\([2016](https://arxiv.org/html/2605.24140#bib.bib50),[2017a](https://arxiv.org/html/2605.24140#bib.bib48),[2017b](https://arxiv.org/html/2605.24140#bib.bib49)\); Schrittwieseret al\.\([2020](https://arxiv.org/html/2605.24140#bib.bib47)\)trains policy and value networks to mimic Monte Carlo tree search, with Expert IterationAnthonyet al\.\([2017](https://arxiv.org/html/2605.24140#bib.bib46)\)and policy distillationRusuet al\.\([2016](https://arxiv.org/html/2605.24140#bib.bib45)\)formalising the iterative recipe; we adopt DAggerRosset al\.\([2011](https://arxiv.org/html/2605.24140#bib.bib54)\)in Stage 2 for its statistical guarantees under the distilled policy’s own state distribution\. In the LLM setting, value\-guided MCTS variantsLiuet al\.\([2024](https://arxiv.org/html/2605.24140#bib.bib44)\); Tianet al\.\([2024](https://arxiv.org/html/2605.24140#bib.bib43)\); Zhanget al\.\([2024](https://arxiv.org/html/2605.24140#bib.bib42)\)distil rollouts via SFT or RL using a separate value head or preference model\. Our distilled signal is not a learned scalar but a geometric quantity read off a hyperbolic embedding of the backbone’s own hidden state, providing a principled geometric prior for the asymmetric reward structure of reasoning trees\.

### 2\.3Hyperbolic Representations

Hyperbolic space embeds hierarchical or tree\-like data with low distortion because ball volume grows exponentially with radius, matching tree branching, as demonstrated by PoincaréNickel and Kiela \([2017](https://arxiv.org/html/2605.24140#bib.bib364)\)and LorentzNickel and Kiela \([2018](https://arxiv.org/html/2605.24140#bib.bib290)\)embeddings with theoretical guarantees inSaet al\.\([2018](https://arxiv.org/html/2605.24140#bib.bib355)\)\. A line of work generalises standard neural\-network layers to the Poincaré ball and other Riemannian manifoldsOctavian\-Eugen Ganeaet al\.\([2018](https://arxiv.org/html/2605.24140#bib.bib105)\); Shimizuet al\.\([2021](https://arxiv.org/html/2605.24140#bib.bib41)\); Chamiet al\.\([2019](https://arxiv.org/html/2605.24140#bib.bib419)\); Chenet al\.\([2022](https://arxiv.org/html/2605.24140#bib.bib418)\), with applications spanning visionKhrulkovet al\.\([2020](https://arxiv.org/html/2605.24140#bib.bib171)\), languageGaneaet al\.\([2018](https://arxiv.org/html/2605.24140#bib.bib363)\), and recent transformer/LLM injectionsYanget al\.\([2026](https://arxiv.org/html/2605.24140#bib.bib40)\)\.Raj \([2026](https://arxiv.org/html/2605.24140#bib.bib25)\)probe frozen LLM hidden states and show hyperbolic classifiers recover reasoning\-relevant hierarchy more robustly than Euclidean ones, but use the geometry only diagnostically\. To our knowledge, hyperbolic space has not previously been used to represent the*state distribution of a reasoning search tree*, nor has distance\-to\-origin been treated as an intrinsic proxy for solution\-proximity in a reasoning system\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.24140v1/pipeline.drawio.png)Figure 1:Architecture overview\.Stage 1 \(Top\):the projection headhϕh\_\{\\phi\}embeds reasoning\-tree states into the Poincaré ball𝔻cn\\mathbb\{D\}^\{n\}\_\{c\}so that distance\-to\-origin tracks distance\-to\-solution and pairwise geodesic distance tracks tree distance\.Stage 2 \(Bottom\):withfθf\_\{\\theta\}andhϕh\_\{\\phi\}frozen, each statests\_\{t\}is encoded to𝐳t\\mathbf\{z\}\_\{t\}and lifted bygψg\_\{\\psi\}into a virtual token spliced into the residual stream before stept\+1t\{\+\}1\. A LoRA adapter is trained on the model’s own rollouts, with a tree oracle providing the target operation at each state\.### 3\.1Preliminaries

##### Task setup\.

We define multi\-step reasoning as a deterministic, finite\-horizon decision process over text statesYaoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib186)\); Haoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib70)\): from an initial states0s\_\{0\}and goalgg, admissible single\-step operations𝒜​\(s\)\\mathcal\{A\}\(s\)generate a search tree𝒯s0,g\\mathcal\{T\}\_\{s\_\{0\},g\}via a deterministic transitionδ\\delta\. We writed​\(s\)d\(s\)for the*distance\-to\-solution*of statess, defined as the minimum BFS edge distance fromssto a successful leaf in𝒯s0,g\\mathcal\{T\}\_\{s\_\{0\},g\}, and∞\\inftywhen no successful trace passes throughss\. Finite distances mark the relatively few solution\-bearing states while∞\\inftymarks the exponentially many dead ends, an asymmetry that motivates the hyperbolic formulation below\.

##### Poincaré ball\.

We work in the Poincaré ball of curvature−c\-c\(c\>0c\{\>\}0\),𝔻cn=\{𝐱∈ℝn:c​‖𝐱‖2<1\}\\mathbb\{D\}^\{n\}\_\{c\}=\\\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}:c\\\|\\mathbf\{x\}\\\|^\{2\}<1\\\}, with geodesic distance

d𝔻​\(𝐱,𝐲\)=1c​cosh−1⁡\(1\+2​c​‖𝐱−𝐲‖2\(1−c​‖𝐱‖2\)​\(1−c​‖𝐲‖2\)\)\.d\_\{\\mathbb\{D\}\}\(\\mathbf\{x\},\\mathbf\{y\}\)=\\tfrac\{1\}\{\\sqrt\{c\}\}\\,\\cosh^\{\-1\}\\\!\\Big\(1\+\\tfrac\{2c\\,\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|^\{2\}\}\{\(1\-c\\\|\\mathbf\{x\}\\\|^\{2\}\)\(1\-c\\\|\\mathbf\{y\}\\\|^\{2\}\)\}\\Big\)\.\(1\)Ball volume in𝔻cn\\mathbb\{D\}^\{n\}\_\{c\}grows exponentially with radius, the geometric reason hyperbolic space embeds trees with low distortionNickel and Kiela \([2017](https://arxiv.org/html/2605.24140#bib.bib364)\); Saet al\.\([2018](https://arxiv.org/html/2605.24140#bib.bib355)\)\. The distance to the origin simplifies tod𝔻​\(𝟎,𝐱\)=\(2/c\)​tanh−1⁡\(c​‖𝐱‖\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{0\},\\mathbf\{x\}\)=\(2/\\sqrt\{c\}\)\\,\\tanh^\{\-1\}\\\!\\big\(\\sqrt\{c\}\\,\\\|\\mathbf\{x\}\\\|\\big\), a monotone function of‖𝐱‖\\\|\\mathbf\{x\}\\\|that diverges as𝐱\\mathbf\{x\}approaches the boundary\. To lift Euclidean hidden states onto the manifold we use the exponential map at the originOctavian\-Eugen Ganeaet al\.\([2018](https://arxiv.org/html/2605.24140#bib.bib105)\),exp𝟎c⁡\(𝐯\)=tanh⁡\(c​‖𝐯‖\)​𝐯/\(c​‖𝐯‖\)\\exp^\{c\}\_\{\\mathbf\{0\}\}\(\\mathbf\{v\}\)=\\tanh\\\!\\big\(\\sqrt\{c\}\\,\\\|\\mathbf\{v\}\\\|\\big\)\\,\\mathbf\{v\}/\\big\(\\sqrt\{c\}\\,\\\|\\mathbf\{v\}\\\|\\big\)\.

### 3\.2Training Pipeline

Figure[1](https://arxiv.org/html/2605.24140#S3.F1)summarises the end\-to\-end pipeline\. Our method factors the problem into two questions:*what*information to surface at each reasoning boundary, and*how*the model should consume it\.Stage 1answers the*what*question by training the projection headhϕh\_\{\\phi\}to map each state into a meaningful location in Poincaré ball; with this geometry in place,Stage 2answers the*how*problem by training a LoRA adapter which acts on𝐳t\\mathbf\{z\}\_\{t\}\.

##### Stage 1: Hyperbolic Space Construction \(Figure[1](https://arxiv.org/html/2605.24140#S3.F1)top\)\.

Since the reasoning process resonates hyperbolic space through a unique structure that solution\-bearing states are few yet connected while dead ends are many but isolated, we intend to model such a property by utilizing a Poincaré ball\. To give the base model a meaningful geometric signal to act on, we first train the projection headhϕ:ℝd→𝔻cnh\_\{\\phi\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{D\}^\{n\}\_\{c\}alone using supervision from the enumerated reasoning tree\. The vector produced byhϕh\_\{\\phi\}should carry two complementary kinds of information: the scalar distance\-to\-solution, and the structural relationships between states\. We achieve this by minimising a weighted sum of two losses:

ℒStage​1​\(ϕ\)=ℒrank​\(ϕ\)\+λ​ℒmetric​\(ϕ\),\\mathcal\{L\}\_\{\\mathrm\{Stage\\,1\}\}\(\\phi\)\\;=\\;\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\(\\phi\)\\;\+\\;\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{metric\}\}\(\\phi\),\(2\)
Radial ranking loss\.This loss ensures that the distance from the origin in the Poincaré ball tracks the distance\-to\-solution of each state\. We sample state pairs\(si,sj\)\(s\_\{i\},s\_\{j\}\)from the same tree withd​\(si\)<d​\(sj\)d\(s\_\{i\}\)<d\(s\_\{j\}\)and minimise the margin hinge

ℒrank​\(ϕ\)=𝔼d​\(si\)<d​\(sj\)​\[max⁡\(0,d𝔻​\(𝟎,𝐳i\)−d𝔻​\(𝟎,𝐳j\)\+γ\)\],\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\(\\phi\)\\;=\\;\\mathbb\{E\}\_\{d\(s\_\{i\}\)<d\(s\_\{j\}\)\}\\Big\[\\max\\\!\\big\(0,\\;d\_\{\\mathbb\{D\}\}\(\\mathbf\{0\},\\mathbf\{z\}\_\{i\}\)\-d\_\{\\mathbb\{D\}\}\(\\mathbf\{0\},\\mathbf\{z\}\_\{j\}\)\+\\gamma\\big\)\\Big\],\(3\)wherefθf\_\{\\theta\}is the frozen pretrained LLM backbone mapping a state to itsℝd\\mathbb\{R\}^\{d\}hidden representation,𝐳i=hϕ​\(fθ​\(si\)\)\\mathbf\{z\}\_\{i\}=h\_\{\\phi\}\(f\_\{\\theta\}\(s\_\{i\}\)\)is the resulting Poincaré embedding \(and analogously𝐳j\\mathbf\{z\}\_\{j\}\), andγ\>0\\gamma\>0is a fixed margin\. This term ensures that the scalar summaryd𝔻​\(𝟎,hϕ​\(fθ​\(s\)\)\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{0\},h\_\{\\phi\}\(f\_\{\\theta\}\(s\)\)\)serves as a monotone proxy for the solution proximity of statess: promising states \(smalld​\(s\)d\(s\)\) are pulled toward the origin while dead ends \(larged​\(s\)d\(s\)\) are pushed toward the boundary\.

Metric preservation loss\.This loss ensures that the fullnn\-dimensional embedding preserves the structural relationships of the reasoning tree\. Letd𝒯​\(si,sj\)d\_\{\\mathcal\{T\}\}\(s\_\{i\},s\_\{j\}\)denote the shortest\-path distance between statessis\_\{i\}andsjs\_\{j\}in𝒯s0,g\\mathcal\{T\}\_\{s\_\{0\},g\}\. We sample state triplets\(si,sj,sk\)\(s\_\{i\},s\_\{j\},s\_\{k\}\)from the same tree such thatd𝒯​\(si,sj\)<d𝒯​\(si,sk\)d\_\{\\mathcal\{T\}\}\(s\_\{i\},s\_\{j\}\)<d\_\{\\mathcal\{T\}\}\(s\_\{i\},s\_\{k\}\)and minimise

ℒmetric​\(ϕ\)=𝔼d𝒯​\(si,sj\)<d𝒯​\(si,sk\)​\[max⁡\(0,d𝔻​\(𝐳i,𝐳j\)−d𝔻​\(𝐳i,𝐳k\)\+γ′\)\],\\mathcal\{L\}\_\{\\mathrm\{metric\}\}\(\\phi\)\\;=\\;\\mathbb\{E\}\_\{d\_\{\\mathcal\{T\}\}\(s\_\{i\},s\_\{j\}\)<d\_\{\\mathcal\{T\}\}\(s\_\{i\},s\_\{k\}\)\}\\Big\[\\max\\\!\\big\(0,\\;d\_\{\\mathbb\{D\}\}\(\\mathbf\{z\}\_\{i\},\\mathbf\{z\}\_\{j\}\)\-d\_\{\\mathbb\{D\}\}\(\\mathbf\{z\}\_\{i\},\\mathbf\{z\}\_\{k\}\)\+\\gamma^\{\\prime\}\\big\)\\Big\],\(4\)whered𝔻​\(𝐳i,𝐳j\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{z\}\_\{i\},\\mathbf\{z\}\_\{j\}\)is the geodesic distance in the Poincaré ball \(Equation[1](https://arxiv.org/html/2605.24140#S3.E1)\),γ′\>0\\gamma^\{\\prime\}\>0is a margin, and the tree distances are precomputed by BFS over the enumerated tree\. Because the geodesic depends on both norms and the relative angle between𝐳i\\mathbf\{z\}\_\{i\}and𝐳j\\mathbf\{z\}\_\{j\}\(via the cross\-ratio in Equation[1](https://arxiv.org/html/2605.24140#S3.E1)\),ℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}sends gradients through the angular coordinates thatℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}leaves unsupervised\.

##### Monte\-Carlo variant for non\-enumerable trees\.

When the full state tree is unavailable \(e\.g\. competition mathematics, used in our MATH experiments\), both loss terms are approximated from sampled rollouts:d​\(s\)d\(s\)is replaced by a Monte\-Carlo success\-rate estimated^​\(s\)\\hat\{d\}\(s\)and pairwise tree distances are read off shared rollout prefixes, with full details in Appendix[C\.2](https://arxiv.org/html/2605.24140#A3.SS2)\.

##### Stage 2: Hyperbolic Space Adaption \(Figure[1](https://arxiv.org/html/2605.24140#S3.F1)bottom\)\.

Stage 1 yields a hyperbolic embedding𝐳t=hϕ​\(fθ​\(st\)\)∈𝔻cn\\mathbf\{z\}\_\{t\}=h\_\{\\phi\}\(f\_\{\\theta\}\(s\_\{t\}\)\)\\in\\mathbb\{D\}^\{n\}\_\{c\}for every statests\_\{t\}\. Stage 2 teaches the base model to consume this embedding\. At each step boundarytt, a small up\-projectorgψ:ℝn→ℝdg\_\{\\psi\}:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{d\}lifts𝐳t\\mathbf\{z\}\_\{t\}into the backbone’s hidden space, andgψ​\(𝐳t\)g\_\{\\psi\}\(\\mathbf\{z\}\_\{t\}\)is spliced into the residual stream as the input embedding of one extra position immediately before the tokens of stept\+1t\{\+\}1; the transformer attends to it through all layers exactly as it would a normal token\. A low\-rank adapterΔ​θ\\Delta\\thetaon the attention projections offθf\_\{\\theta\}is trained to make use of this vector\.

To avoid limited supervision signal from teacher\-forcing training paradigm, we adopt DAggerRosset al\.\([2011](https://arxiv.org/html/2605.24140#bib.bib54)\)as our training algorithm\. The expert is the closed\-form tree oracle

𝒪​\(s\)=\{a∈𝒜​\(s\):d​\(δ​\(s,a\)\)<∞\},\\mathcal\{O\}\(s\)\\;=\\;\\big\\\{\\,a\\in\\mathcal\{A\}\(s\)\\,:\\,d\(\\delta\(s,a\)\)<\\infty\\,\\big\\\},\(5\)i\.e\. the set of single\-step operations whose resulting state still admits a path to the target\. Each epoch alternates a*rollout phase*which samples a trajectory under the current policy, encodingst↦𝐳ts\_\{t\}\\mapsto\\mathbf\{z\}\_\{t\}at every step boundary, then an*update phase*that for each reachedsts\_\{t\}with𝒪​\(st\)≠∅\\mathcal\{O\}\(s\_\{t\}\)\\neq\\varnothing, selects one winning operationat⋆∈𝒪​\(st\)a^\{\\star\}\_\{t\}\\in\\mathcal\{O\}\(s\_\{t\}\)by deterministic lexicographic tiebreak and minimises

ℒDAgger​\(Δ​θ,ψ\)=−∑tlog⁡pθ\+Δ​θ,ψ​\(at⋆\|promptt,gψ​\(𝐳t\)\),\\mathcal\{L\}\_\{\\mathrm\{DAgger\}\}\(\\Delta\\theta,\\psi\)\\;=\\;\-\\sum\_\{t\}\\;\\log p\_\{\\theta\+\\Delta\\theta,\\,\\psi\}\\\!\\Big\(\\,a^\{\\star\}\_\{t\}\\,\\Big\|\\,\\text\{prompt\}\_\{t\},\\;g\_\{\\psi\}\(\\mathbf\{z\}\_\{t\}\)\\,\\Big\),\(6\)with gradients flowing only intoΔ​θ\\Delta\\thetaandgψg\_\{\\psi\}\. Rollout details are deferred to Appendix[C](https://arxiv.org/html/2605.24140#A3)\.

### 3\.3Extension to Task\-Agnostic Training

The pipeline of Section[3\.2](https://arxiv.org/html/2605.24140#S3.SS2)is made task\-agnostic by replacing each training problem’s single fixed objective with a set of*\(context, goal\)*pairs sampled from every internal node of its enumerated reasoning tree, where the goal at a node is any terminal value reachable from it\. A single group\-level LoRA adapter trained on this augmented data can then be paired with a cheaply retrained task\-specific projection head and transferred to structurally related tasks that share the same reasoning\-tree motif; for non\-enumerable tasks \(e\.g\. MATH\), the head is retrained with the Monte\-Carlo variants ofℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}andℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}from Section[3\.2](https://arxiv.org/html/2605.24140#S3.SS2)\. The full augmentation procedure is detailed in Appendix[B\.2](https://arxiv.org/html/2605.24140#A2.SS2)\.

### 3\.4Inference

At deployment, generation follows the same loop described in the rollout phase of Stage 2: before each step, the current state is encoded by the frozen backbone, mapped to a hyperbolic point then lifted by the up\-projector, and spliced into the residual stream as a single virtual token\. The total cost is one greedy decode plusO​\(1\)O\(1\)extra work per step boundary\.

## 4Experiments

### 4\.1Experimental Settings

##### Tasks and Datasets\.

We organise our evaluation into two groups of tasks\. Group A tasks \(Game of 24Yaoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib186)\), N\-Queens, BlocksworldValmeekamet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib33)\), Graph ColoringHeyman and Zylberberg \([2025](https://arxiv.org/html/2605.24140#bib.bib29)\)\) share a*state\-reduction*motif: each step consumes elements from a finite pool via a locally compositional binary operation, yielding trees of fixed depth with monotonically decreasing branching factor\. Group B tasks \(Rule\-chaining, ProntoQASaparov and He \([2023](https://arxiv.org/html/2605.24140#bib.bib35)\), ProofWriterTafjordet al\.\([2021](https://arxiv.org/html/2605.24140#bib.bib27)\),MATHLightmanet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib179)\)\) share a*state\-expansion*motif: each step derives a new fact by applying an inference rule to existing premises, yielding chain\-like trees with monotonically growing state\. We summarize the test benchmarks in Table[1](https://arxiv.org/html/2605.24140#S4.T1)\. More details on the datasets are in Appendix[B](https://arxiv.org/html/2605.24140#A2)\.

Table 1:Task categorisation with the domain and the sampled test\-set size used in our evaluation\.Task TypeDomainDataset NameDescriptionTest Data SizeGroup AArithmeticGame of 24Reach 24 with four operands100Constraint satisfactionN\-Queens \(N=8N\{=\}8\)Place 8 non\-attacking queens81Symbolic planningBlocksworldBlock\-stacking planning350Constraint satisfactionGraph Coloringkk\-colour adjacent vertices differ500Group BForward chainingRule\-chainingHorn\-clause forward chaining600First\-order reasoningProntoQAFO reasoning over ontologies800First\-order logicProofWriterPremise\-conclusion validity500Competition mathMATHCompetition\-level math problems500
##### Base Models\.

We instantiate our pipeline on three open\-weight backbones drawn from different families and scales:Qwen2\.5\-14B\-InstructQwenet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib32)\),GPT\-OSS\-20BOpenAIet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib31)\)run in its*no\-thinking*mode so that inference is a standard single forward pass, andMistral\-Small\-3\.2\-24BMistral AI \([2025](https://arxiv.org/html/2605.24140#bib.bib30)\)\.

##### Baselines\.

We compare against six baselines spanning the accuracy–compute frontier: two single\-pass prompting methods,Few\-shotBrownet al\.\([2020](https://arxiv.org/html/2605.24140#bib.bib36)\)andSelf\-Consistency \(SC\)Wanget al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib185)\); a tree\-search methodTree of Thoughts \(ToT\)Yaoet al\.\([2023](https://arxiv.org/html/2605.24140#bib.bib186)\); a value\-guided search method with a learned value model,OVMYuet al\.\([2024](https://arxiv.org/html/2605.24140#bib.bib28)\); and two fine\-tuning baselines that augment reasoning with auxiliary tokens,PT\-SFTWanget al\.\([2024b](https://arxiv.org/html/2605.24140#bib.bib107)\)andSoftCoTXuet al\.\([2025](https://arxiv.org/html/2605.24140#bib.bib22)\)\.

Training details are documented in Appendix[C](https://arxiv.org/html/2605.24140#A3)and hyperparameter settings in Appendix[B\.4](https://arxiv.org/html/2605.24140#A2.SS4)\.

### 4\.2Main Results

Table[2](https://arxiv.org/html/2605.24140#S4.T2)reports in\-domain accuracy and Table[3](https://arxiv.org/html/2605.24140#S4.T3)reports out\-of\-domain transfer\. MATH has no enumerable derivation tree, so exact distance\-to\-solution is unavailable for per\-dataset in\-domain training; we therefore omit it from Table[2](https://arxiv.org/html/2605.24140#S4.T2)and report it only in Table[3](https://arxiv.org/html/2605.24140#S4.T3)under the Monte\-Carlo head variant of Section[3\.3](https://arxiv.org/html/2605.24140#S3.SS3)\. We highlight two summary observations\.

##### In\-domain Accuracy\.

Across all seven in\-domain datasets under per\-dataset training, HyperGuide outperforms most of the baselines while decoding in a single pass\. Tree of Thoughts underperforms few\-shot prompting on several tasks \(Graph Coloring, ProntoQA under Qwen2\.5\)\. The common failure mode is the self\-evaluation value scorer: when the base model cannot reliably rank partial solutions, the search budget is spent on unpromising branches, and the branching overhead actively hurts\. This is consistent with known limitations of prompt\-based ToT scoring; a learned verifier would likely close part of the gap, but would also add a training cost comparable to OVM, shifting the comparison to a different point on the accuracy–compute frontier\. Full implementation details, including per\-task prompt pairs and scoring weights, are in Appendix[C\.3](https://arxiv.org/html/2605.24140#A3.SS3)\.

##### Out\-of\-domain Transfer\.

Because the LoRA adapter is trained task\-agnostically on augmented data, a single group\-level adapter transfers to the out\-of\-domain datasets when paired with a dataset\-specific head\. Table[3](https://arxiv.org/html/2605.24140#S4.T3)reports this regime: the group\-level adapter trained on each group’s augmented lead in\-domain distribution \(Game of 24 for Group A, Rule\-chaining for Group B\) is paired with a small per\-dataset projection head and evaluated on every dataset in its group, so for each group’s lead in\-domain dataset the cell measures whether task\-agnostic training preserves in\-domain accuracy, and for the remaining datasets the cell measures genuine out\-of\-domain transfer\. MATH stresses the Monte\-Carlo head variant: even without an enumerable tree, the geometric signal transfers and HyperGuide leads on two of three backbones\.

Table 2:In\-domain test results \(%\)\.Boldmarks the best result andunderlinemarks the second best in each column within each backbone block \(starred entries are excluded from the ranking\)\. \*PT\-SFT is memorization on PlanBench gold \(same distribution as test\); not evidence of compositional planning\.†N\-Queens usesN=7N\{=\}7boards at training time andN=8N\{=\}8at test time\.Group AGroup BBase modelMethodGame of24N\-Queens†\(N=8N\{=\}8\)BlocksworldGraphColoringRule\-chainingProntoQAProofWriterQwen2\.5Few\-shot119\.94163536070\.4Self\-Consistency2111\.16060785874Tree of Thoughts103\.75834524169OVM154\.981\.459\.4846738PT\-SFT79\.996\*647752\.549SoftCoT2722\.2827973\.57275HyperGuide5727\.28788807577\.4GPT\-OSSFew\-shot86\.195116\.14829\.6Self\-Consistency164\.9257\.4523947Tree of Thoughts913\.63749504446OVM134\.977\.453706233PT\-SFT79\.993\*587256\.542\.4SoftCoT2118\.5796867\.56461\.4HyperGuide4224\.78381776867MistralFew\-shot811\.15749548157Self\-Consistency1511\.15349\.674\.57367\.6Tree of Thoughts613\.65514506966OVM93\.76255\.471\.57972PT\-SFT716\.197\*527281\.542\.6SoftCoT179\.9715763\.56157HyperGuide4418\.57663798987

Table 3:Out\-of\-domain transfer results \(%\)\.Boldmarks the best result andunderlinemarks the second best in each column within each backbone block \(starred entries are excluded from the ranking\)\. For each backbone, a single group\-level HyperGuide adapter is trained on the lead in\-domain distribution \(Game of 24 / Rule\-chaining\) and combined with a data\-specific projection head\.Group AGroup BBase modelMethodGame of24N\-Queens\(N=8N\{=\}8\)BlocksworldGraphColoringRule\-chainingProntoQAProofWriterMATHTransfercostQwen2\.5Few\-shot119\.94163536070\.471\.2N/ASelf\-Consistency2111\.1606078587472N/ATree of Thoughts103\.7583452416974N/AOVM154\.981\.459\.484673880\.6full retrainPT\-SFT79\.996\*647752\.54976full retrainSoftCoT2722\.2827973\.5727559\.8full retrainHyperGuide5522\.2796477747584\.4small MLPGPT\-OSSFew\-shot86\.195116\.14829\.662\.8N/ASelf\-Consistency164\.9257\.452394764N/ATree of Thoughts913\.6374950444667N/AOVM134\.977\.45370623377\.4full retrainPT\-SFT79\.993\*587256\.542\.476\.6full retrainSoftCoT2118\.5826867\.56461\.475\.4full retrainHyperGuide3919\.8806374595979\.8small MLPMistralFew\-shot811\.1574954815743\.4N/ASelf\-Consistency1511\.15349\.674\.57367\.646N/ATree of Thoughts613\.6551450696647N/AOVM93\.76255\.471\.5797253full retrainPT\-SFT716\.197\*527281\.542\.649\.8full retrainSoftCoT179\.9715763\.5615745\.8full retrainHyperGuide3816\.1635770846951small MLP

### 4\.3Ablation Study

Table[4](https://arxiv.org/html/2605.24140#S4.T4)isolates the contribution of each component of our pipeline on Qwen2\.5\-14B\-Instruct, with one column per dataset so that per\-dataset patterns \(in particular, the dependence of each component on task depth\) are visible rather than averaged out\. The ablations follow the same per\-dataset in\-domain regime as Table[2](https://arxiv.org/html/2605.24140#S4.T2): a separate adapter is trained on each dataset’s own training set and evaluated on its test set\. We consider the following ablations:

- •w/o hyperbolic \(Euclidean head\): the Poincaré ball𝔻cn\\mathbb\{D\}^\{n\}\_\{c\}is replaced byℝn\\mathbb\{R\}^\{n\}with identical head capacity and the same ranking loss\.
- •w/oℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}: Stage 1 is trained with the radial ranking lossℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}alone, removing the angular supervision and isolating the contribution of tree\-metric preservation\.
- •w/o value signal \(DAgger only\): the up\-projector is removed and no geometric token is injected, so the adapter is trained by pure DAgger on oracle\-labelled states\.
- •w/o DAgger \(offline SFT\): rollouts are replaced by on\-tree states sampled from the oracle, so the adapter is trained on a fixed distribution rather than its own rollout distribution\.
- •Head dimensionnn: ablations atn∈\{32,64,256\}n\\in\\\{32,64,256\\\}, characterising how embedding capacity interacts with the geometric prior\.

Table 4:Ablation study results on Qwen2\.5 \(%\)\.Boldmarks the best result\.Group AGroup BVariantGame of24N\-Queens\(N=8N\{=\}8\)BlocksworldGraphColoringRule\-chainingProntoQAProofWriterHyperGuide \(full\)5727\.28788807577\.4w/o hyperbolic \(Euclidean\)4316\.175726965\.570w/oℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}5224\.7848677\.57175\.5w/o value signal \(DAgger only\)47218080716869\.5w/o DAgger \(offline SFT\)164\.95035\.5393558n=32n=3249218081777271n=64n=645324\.7848679\.57675\.5n=256n=2565525\.985\.286\.279\.875\.576\.6

The most pronounced drop occurs when DAgger is replaced by offline SFT \(row 5\), with accuracy falling by more than half on most tasks\. This collapse is specific to the injected\-signal architecture: the adapter must learn to interpret the hyperbolic token at the reasoning states it will actually encounter during generation, but offline SFT exposes it only to states along gold traces\. At inference, the model’s own errors lead it into dead\-end states that were absent from training\. Removing the geometric channel while retaining DAgger\-style training \(row 4\) yields a smaller but consistent drop, isolating the contribution of the hyperbolic signal from that of the training regime\. The remaining rows decompose signal quality: replacing hyperbolic with Euclidean geometry \(row 2\) removes the exponential volume structure that matches the asymmetry between scarce solution\-bearing states and abundant dead ends, while droppingℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}\(row 3\) preserves radial ordering but collapses angular separation, reducing distinguishability among states at similar depth\. Sensitivity to the continuous loss hyperparameters is reported in Appendix[B\.5](https://arxiv.org/html/2605.24140#A2.SS5); accuracy is stable across a wide range of values for each, confirming that the gains are not an artefact of precise tuning\.

### 4\.4Inference Efficiency

![Refer to caption](https://arxiv.org/html/2605.24140v1/gc_efficiency.png)Figure 2:Accuracy versus inference cost\.Figure[2](https://arxiv.org/html/2605.24140#S4.F2)plots Graph Coloring accuracy against the average number of generated tokens per problem\. HyperGuide sits at the upper\-left frontier: at each step boundary the projection head and up\-projector add only twoO​\(1\)O\(1\)MLP evaluations over the cached backbone hidden state, whose combined FLOPs are negligible relative to a single LLM forward pass\. Neither head emits tokens so HyperGuide adds no per\-step token overhead; in practice the geometric signal further steers the model toward shorter, more direct reasoning paths, so HyperGuide’s total per\-problem token count is the lowest among all methods\. By contrast, Tree of Thoughts decodes a separate thought or value verdict at each ofb​\(1\+v\)​D=60b\(1\+v\)D=60search nodes \(beam widthb=5b\{=\}5, depthD=3D\{=\}3,v=3v\{=\}3value votes per candidate\), inflating the per\-problem token count by an order of magnitude through branching and backtracking\.

### 4\.5Depth Scaling Analysis

A central prediction of HyperGuide is that the hyperbolic value signal yields larger gains as reasoning chains grow longer\. The intuition is geometric: the Poincar’e ball’s exponential volume expansion matches the exponential branching of deep search trees, so distant states stay geometrically separated where a Euclidean embedding would lose them to crowding\. We test this by stratifying accuracy by a per\-instance depth measure on two datasets\.

##### Rule\-chaining\.

We stratify the Rule\-chaining test set by gold chain lengthnsteps∈\{2,3,4\}n\_\{\\mathrm\{steps\}\}\\in\\\{2,3,4\\\}, the number of Horn\-clause rule applications required to derive the target fact\. Figure[3](https://arxiv.org/html/2605.24140#S4.F3)plots accuracy at eachnstepsn\_\{\\mathrm\{steps\}\}bin\. The picture reverses as depth grows\. Such trend is consistent with the hypothesis that the hyperbolic value signal becomes increasingly useful as the search tree deepens\.

##### ProofWriter\.

We stratify ProofWriter accuracy by question depthQDep\\mathrm\{QDep\}, the number of inference\-rule applications required to derive the target conclusion\. Figure[3](https://arxiv.org/html/2605.24140#S4.F3)shows their accuracy\. The same crossover pattern emerges\. Starting fromQDep=2\\mathrm\{QDep\}\{=\}2, HyperGuide takes the lead over all baselines\. SinceQDep\\mathrm\{QDep\}measures inferential depth differently fromnstepsn\_\{\\mathrm\{steps\}\}, the recurrence of the same crossover under both metrics points to depth itself, rather than any dataset\-specific artifact\.

Across both datasets, the results support a consistent conclusion: the hyperbolic value signal is most valuable precisely where baselines struggle most: deeper chains that demand sustained lookahead\.

![Refer to caption](https://arxiv.org/html/2605.24140v1/rulechain_nsteps.png)\(a\)Rule\-chaining accuracy at chain length\{2,3,4\}\\\{2,3,4\\\}\.
![Refer to caption](https://arxiv.org/html/2605.24140v1/proofwriter_qdep.png)\(b\)ProofWriter accuracy atQDep∈\{0,1,2,3\}\\mathrm\{QDep\}\\in\\\{0,1,2,3\\\}\.

Figure 3:Depth\-scaling results on Group B\. Left: Rule\-chaining stratified by gold chain length\. Right: ProofWriter stratified by question depth\.

### 4\.6Signal Mechanism Analysis

In this section we look inside the embedding itself and ask whether its two intended axes, i\.e\. radial \(distance\-to\-origin\) and structural \(pairwise geodesic distance\), actually carry the quantities the training objectives target\.

##### Distributional shift tracks hyperbolic distance\.

Using the 100 step\-boundary states Game of 24 test set, we record the full next\-token logit distribution under\+signal\(the spliced virtual tokengψ​\(𝐳t\)g\_\{\\psi\}\(\\mathbf\{z\}\_\{t\}\)is included in the residual stream\) and–signal\(the splice position is retained but filled withgψ​\(𝐳¯\)g\_\{\\psi\}\(\\bar\{\\mathbf\{z\}\}\), where𝐳¯\\bar\{\\mathbf\{z\}\}is the mean embedding over the training\-state population, giving an in\-distribution but state\-uninformative input\), and compute the KL divergenceDKL​\(p\+∥p−\)D\_\{\\mathrm\{KL\}\}\(p\_\{\+\}\\\|p\_\{\-\}\)between the two\. Figure[4](https://arxiv.org/html/2605.24140#S4.F4)plots this divergence against the hyperbolic distance\-to\-origind𝔻​\(𝟎,𝐳\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{0\},\\mathbf\{z\}\)of each state’s embedding\. Across step\-boundary states, KL grows withd𝔻d\_\{\\mathbb\{D\}\}, indicating that the adapter modulates its predictions most strongly at states the radial signal identifies as distant from a solution, consistent with the radial axis carrying its strongest distinguishing information precisely where the backbone’s default continuation is most likely to be wrong and must be overridden\.

##### Geodesic distance tracks tree distance\.

Whileℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}shapes the radial axis, the structural axis is installed byℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}, which enforces, for each anchorsis\_\{i\}, that the ordering ofd𝔻​\(𝐳i,⋅\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{z\}\_\{i\},\\cdot\)matches the ordering ofd𝒯​\(si,⋅\)d\_\{\\mathcal\{T\}\}\(s\_\{i\},\\cdot\)\. To test this directly, we sample anchor states from Game\-of\-24 reasoning trees and compute the per\-anchor Spearmanρi\\rho\_\{i\}between tree distancesd𝒯​\(si,⋅\)d\_\{\\mathcal\{T\}\}\(s\_\{i\},\\cdot\)and Poincaré geodesic distancesd𝔻​\(𝐳i,⋅\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{z\}\_\{i\},\\cdot\)\. Figure[4](https://arxiv.org/html/2605.24140#S4.F4)reports the distribution ofρi\\rho\_\{i\}for the full model and the w/oℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}ablation\. The full HyperGuide head reaches a medianρ\\rhoof\+0\.84\+0\.84, compared to\+0\.41\+0\.41without the metric loss, confirming thatℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}installs the per\-anchor structural ordering it is designed to enforce\. Combined with Figure[4](https://arxiv.org/html/2605.24140#S4.F4), this confirms the embedding encodes both*how far*a state is from a solution \(radial\) and*where in the tree*it lies \(structural\), giving the adapter a geometrically faithful summary of the local search landscape\.

![Refer to caption](https://arxiv.org/html/2605.24140v1/target_kl_vs_d.png)\(a\)KL divergence between\+signaland–signalnext\-token distributions as a function of hyperbolic distance\-to\-origin\. Each point is one step\-boundary state from the Game\-of\-24 test set\.
![Refer to caption](https://arxiv.org/html/2605.24140v1/target_per_anchor_rho.png)\(b\)Distribution of per\-anchor Spearmanρi\\rho\_\{i\}between tree distancesd𝒯​\(si,⋅\)d\_\{\\mathcal\{T\}\}\(s\_\{i\},\\cdot\)and Poincaré geodesic distancesd𝔻​\(𝐳i,⋅\)d\_\{\\mathbb\{D\}\}\(\\mathbf\{z\}\_\{i\},\\cdot\), comparing the full model against the w/oℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}ablation\. Each point is one anchor state; horizontal bars mark the median\.

Figure 4:Signal mechanism analysis\. Left: KL divergence vs\. hyperbolic distance\-to\-origin \(radial axis\)\. Right: per\-anchor Spearmanρ\\rhobetween tree distance and geodesic distance, validating the metric loss \(structural axis\)\.

## 5Conclusion

We presented HyperGuide, a method that injects a hyperbolic geometric signal into LLM reasoning to provide step\-level guidance without runtime search\. The approach is grounded in a structural correspondence between combinatorial reasoning trees, in which solution\-bearing states are scarce and dead ends abundant, and the Poincaré ball, whose compact center and exponentially expanding boundary accommodate this asymmetry\. Across arithmetic, planning, and deductive\-logic benchmarks, HyperGuide delivers consistent accuracy gains that grow with reasoning depth, while inference cost remains comparable to standard single\-pass decoding\. More broadly, our results indicate that solution\-space geometry is an underexploited inductive bias for LLM reasoning\.

## References

- Thinking Fast and Slow with Deep Learning and Tree Search\.arXiv\.Note:arXiv:1705\.08439 \[cs\]External Links:[Link](http://arxiv.org/abs/1705.08439),[Document](https://dx.doi.org/10.48550/arXiv.1705.08439)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk, and T\. Hoefler \(2024\)Graph of Thoughts: Solving Elaborate Problems with Large Language Models\.Proceedings of the AAAI Conference on Artificial Intelligence38\(16\),pp\. 17682–17690\.Note:arXiv:2308\.09687 \[cs\]External Links:ISSN 2374\-3468, 2159\-5399,[Link](http://arxiv.org/abs/2308.09687),[Document](https://dx.doi.org/10.1609/aaai.v38i16.29720)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language Models are Few\-Shot Learners\.arXiv\.Note:arXiv:2005\.14165 \[cs\]External Links:[Link](http://arxiv.org/abs/2005.14165),[Document](https://dx.doi.org/10.48550/arXiv.2005.14165)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px3.p1.1)\.
- I\. Chami, R\. Ying, C\. Ré, and J\. Leskovec \(2019\)Hyperbolic Graph Convolutional Neural Networks\.arXiv\.Note:arXiv:1910\.12933 \[cs\]External Links:[Link](http://arxiv.org/abs/1910.12933),[Document](https://dx.doi.org/10.48550/arXiv.1910.12933)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- W\. Chen, X\. Han, Y\. Lin, H\. Zhao, Z\. Liu, P\. Li, M\. Sun, and J\. Zhou \(2022\)Fully Hyperbolic Neural Networks\.arXiv\.Note:arXiv:2105\.14686 \[cs\]External Links:[Link](http://arxiv.org/abs/2105.14686),[Document](https://dx.doi.org/10.48550/arXiv.2105.14686)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Ding, H\. Xin, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Wang, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, S\. Ye, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Zhao, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning\.Nature645\(8081\),pp\. 633–638\.Note:arXiv:2501\.12948 \[cs\]External Links:ISSN 0028\-0836, 1476\-4687,[Link](http://arxiv.org/abs/2501.12948),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- O\. Ganea, G\. Bécigneul, and T\. Hofmann \(2018\)Hyperbolic Entailment Cones for Learning Hierarchical Embeddings\.arXiv\.Note:arXiv:1804\.01882 \[cs\]External Links:[Link](http://arxiv.org/abs/1804.01882),[Document](https://dx.doi.org/10.48550/arXiv.1804.01882)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- S\. Hao, Y\. Gu, H\. Ma, J\. Hong, Z\. Wang, D\. Wang, and Z\. Hu \(2023\)Reasoning with Language Model is Planning with World Model\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 8154–8173\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.507/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.507)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.24140#S3.SS1.SSS0.Px1.p1.12)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. E\. Weston, and Y\. Tian \(2025\)Training Large Language Models to Reason in a Continuous Latent Space\.\(en\)\.External Links:[Link](https://openreview.net/forum?id=Itxz7S4Ip3#discussion)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- A\. Heyman and J\. Zylberberg \(2025\)Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring\.arXiv\.Note:arXiv:2502\.07087 \[cs\]External Links:[Link](http://arxiv.org/abs/2502.07087),[Document](https://dx.doi.org/10.48550/arXiv.2502.07087)Cited by:[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Jiang, F\. Wang, J\. Shen, S\. Kim, and S\. Kim \(2026\)A Survey on Large Language Models for Code Generation\.ACM Transactions on Software Engineering and Methodology35\(2\),pp\. 1–72\(en\)\.External Links:ISSN 1049\-331X, 1557\-7392,[Link](https://dl.acm.org/doi/10.1145/3747588),[Document](https://dx.doi.org/10.1145/3747588)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p1.1)\.
- V\. Khrulkov, L\. Mirvakhabova, E\. Ustinova, I\. Oseledets, and V\. Lempitsky \(2020\)Hyperbolic Image Embeddings\.In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Seattle, WA, USA,pp\. 6417–6427\(en\)\.External Links:ISBN 978\-1\-7281\-7168\-5,[Link](https://ieeexplore.ieee.org/document/9156432/),[Document](https://dx.doi.org/10.1109/CVPR42600.2020.00645)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s Verify Step by Step\.arXiv\.Note:arXiv:2305\.20050 \[cs\]External Links:[Link](http://arxiv.org/abs/2305.20050),[Document](https://dx.doi.org/10.48550/arXiv.2305.20050)Cited by:[§B\.1\.2](https://arxiv.org/html/2605.24140#A2.SS1.SSS2.Px8.p1.1),[§1](https://arxiv.org/html/2605.24140#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Liu, A\. Cohen, R\. Pasunuru, Y\. Choi, H\. Hajishirzi, and A\. Celikyilmaz \(2024\)Don’t throw away your value model\! Generating more preferable text with Value\-Guided Monte\-Carlo Tree Search decoding\.arXiv\.Note:arXiv:2309\.15028 \[cs\]External Links:[Link](http://arxiv.org/abs/2309.15028),[Document](https://dx.doi.org/10.48550/arXiv.2309.15028)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- Mistral AI \(2025\)Mistralai/Mistral\-Small\-3\.2\-24B\-Instruct\-2506 · Hugging Face\.External Links:[Link](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506)Cited by:[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Nickel and D\. Kiela \(2017\)Poincaré Embeddings for Learning Hierarchical Representations\.arXiv\.Note:arXiv:1705\.08039 \[cs\]External Links:[Link](http://arxiv.org/abs/1705.08039),[Document](https://dx.doi.org/10.48550/arXiv.1705.08039)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2605.24140#S3.SS1.SSS0.Px2.p1.8)\.
- M\. Nickel and D\. Kiela \(2018\)Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry\.arXiv\.Note:arXiv:1806\.03417 \[cs\]External Links:[Link](http://arxiv.org/abs/1806.03417),[Document](https://dx.doi.org/10.48550/arXiv.1806.03417)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- Octavian\-Eugen Ganea, G\. Bécigneul, and T\. Hofmann \(2018\)Hyperbolic Neural Networks\.arXiv\.Note:arXiv:1805\.09112 \[cs\]External Links:[Link](http://arxiv.org/abs/1805.09112),[Document](https://dx.doi.org/10.48550/arXiv.1805.09112)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2605.24140#S3.SS1.SSS0.Px2.p1.8)\.
- OpenAI, S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao, B\. Barak, A\. Bennett, T\. Bertao, N\. Brett, E\. Brevdo, G\. Brockman, S\. Bubeck, C\. Chang, K\. Chen, M\. Chen, E\. Cheung, A\. Clark, D\. Cook, M\. Dukhan, C\. Dvorak, K\. Fives, V\. Fomenko, T\. Garipov, K\. Georgiev, M\. Glaese, T\. Gogineni, A\. Goucher, L\. Gross, K\. G\. Guzman, J\. Hallman, J\. Hehir, J\. Heidecke, A\. Helyar, H\. Hu, R\. Huet, J\. Huh, S\. Jain, Z\. Johnson, C\. Koch, I\. Kofman, D\. Kundel, J\. Kwon, V\. Kyrylov, E\. Y\. Le, G\. Leclerc, J\. P\. Lennon, S\. Lessans, M\. Lezcano\-Casado, Y\. Li, Z\. Li, J\. Lin, J\. Liss, Lily, Liu, J\. Liu, K\. Lu, C\. Lu, Z\. Martinovic, L\. McCallum, J\. McGrath, S\. McKinney, A\. McLaughlin, S\. Mei, S\. Mostovoy, T\. Mu, G\. Myles, A\. Neitz, A\. Nichol, J\. Pachocki, A\. Paino, D\. Palmie, A\. Pantuliano, G\. Parascandolo, J\. Park, L\. Pathak, C\. Paz, L\. Peran, D\. Pimenov, M\. Pokrass, E\. Proehl, H\. Qiu, G\. Raila, F\. Raso, H\. Ren, K\. Richardson, D\. Robinson, B\. Rotsted, H\. Salman, S\. Sanjeev, M\. Schwarzer, D\. Sculley, H\. Sikchi, K\. Simon, K\. Singhal, Y\. Song, D\. Stuckey, Z\. Sun, P\. Tillet, S\. Toizer, F\. Tsimpourlas, N\. Vyas, E\. Wallace, X\. Wang, M\. Wang, O\. Watkins, K\. Weil, A\. Wendling, K\. Whinnery, C\. Whitney, H\. Wong, L\. Yang, Y\. Yang, M\. Yasunaga, K\. Ying, W\. Zaremba, W\. Zhan, C\. Zhang, B\. Zhang, E\. Zhang, and S\. Zhao \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b Model Card\.arXiv\.Note:arXiv:2508\.10925 \[cs\]External Links:[Link](http://arxiv.org/abs/2508.10925),[Document](https://dx.doi.org/10.48550/arXiv.2508.10925)Cited by:[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px2.p1.1)\.
- Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 Technical Report\.arXiv\.Note:arXiv:2412\.15115 \[cs\]External Links:[Link](http://arxiv.org/abs/2412.15115),[Document](https://dx.doi.org/10.48550/arXiv.2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Raj \(2026\)HYPERBOLIC GEOMETRY OF REASONING: PROBING LLM HIDDEN STATES\.\(en\)\.Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- S\. Ross, G\. J\. Gordon, and J\. A\. Bagnell \(2011\)A Reduction of Imitation Learning and Structured Prediction to No\-Regret Online Learning\.arXiv\.Note:arXiv:1011\.0686 \[cs\]External Links:[Link](http://arxiv.org/abs/1011.0686),[Document](https://dx.doi.org/10.48550/arXiv.1011.0686)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.24140#S3.SS2.SSS0.Px3.p2.7)\.
- A\. A\. Rusu, S\. G\. Colmenarejo, C\. Gulcehre, G\. Desjardins, J\. Kirkpatrick, R\. Pascanu, V\. Mnih, K\. Kavukcuoglu, and R\. Hadsell \(2016\)Policy Distillation\.arXiv\.Note:arXiv:1511\.06295 \[cs\]External Links:[Link](http://arxiv.org/abs/1511.06295),[Document](https://dx.doi.org/10.48550/arXiv.1511.06295)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- C\. D\. Sa, A\. Gu, C\. Ré, and F\. Sala \(2018\)Representation Tradeoffs for Hyperbolic Embeddings\.arXiv\.Note:arXiv:1804\.03329 \[cs\]External Links:[Link](http://arxiv.org/abs/1804.03329),[Document](https://dx.doi.org/10.48550/arXiv.1804.03329)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2605.24140#S3.SS1.SSS0.Px2.p1.8)\.
- A\. Saparov and H\. He \(2023\)Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain\-of\-Thought\.arXiv\.Note:arXiv:2210\.01240 \[cs\]External Links:[Link](http://arxiv.org/abs/2210.01240),[Document](https://dx.doi.org/10.48550/arXiv.2210.01240)Cited by:[§B\.1\.2](https://arxiv.org/html/2605.24140#A2.SS1.SSS2.Px5.p1.5),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. Lillicrap, and D\. Silver \(2020\)Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model\.Nature588\(7839\),pp\. 604–609\.Note:arXiv:1911\.08265 \[cs\]External Links:ISSN 0028\-0836, 1476\-4687,[Link](http://arxiv.org/abs/1911.08265),[Document](https://dx.doi.org/10.1038/s41586-020-03051-4)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He \(2025\)CODI: Compressing Chain\-of\-Thought into Continuous Space via Self\-Distillation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 677–693\.External Links:ISBN 979\-8\-89176\-332\-6,[Link](https://aclanthology.org/2025.emnlp-main.36/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.36)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- R\. Shimizu, Y\. Mukuta, and T\. Harada \(2021\)Hyperbolic Neural Networks\+\+\.arXiv\.Note:arXiv:2006\.08210 \[cs\]External Links:[Link](http://arxiv.org/abs/2006.08210),[Document](https://dx.doi.org/10.48550/arXiv.2006.08210)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. van den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot, S\. Dieleman, D\. Grewe, J\. Nham, N\. Kalchbrenner, I\. Sutskever, T\. Lillicrap, M\. Leach, K\. Kavukcuoglu, T\. Graepel, and D\. Hassabis \(2016\)Mastering the game of Go with deep neural networks and tree search\.Nature529\(7587\),pp\. 484–489\(en\)\.External Links:ISSN 1476\-4687,[Link](https://www.nature.com/articles/nature16961),[Document](https://dx.doi.org/10.1038/nature16961)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel, T\. Lillicrap, K\. Simonyan, and D\. Hassabis \(2017a\)Mastering Chess and Shogi by Self\-Play with a General Reinforcement Learning Algorithm\.arXiv\.Note:arXiv:1712\.01815 \[cs\]External Links:[Link](http://arxiv.org/abs/1712.01815),[Document](https://dx.doi.org/10.48550/arXiv.1712.01815)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton, Y\. Chen, T\. Lillicrap, F\. Hui, L\. Sifre, G\. van den Driessche, T\. Graepel, and D\. Hassabis \(2017b\)Mastering the game of Go without human knowledge\.Nature550\(7676\),pp\. 354–359\(en\)\.External Links:ISSN 1476\-4687,[Link](https://www.nature.com/articles/nature24270),[Document](https://dx.doi.org/10.1038/nature24270)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- O\. Tafjord, B\. Dalvi, and P\. Clark \(2021\)ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 3621–3634\.External Links:[Link](https://aclanthology.org/2021.findings-acl.317/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-acl.317)Cited by:[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Tian, B\. Peng, L\. Song, L\. Jin, D\. Yu, H\. Mi, and D\. Yu \(2024\)Toward Self\-Improvement of LLMs via Imagination, Searching, and Criticizing\.arXiv\.Note:arXiv:2404\.12253 \[cs\]External Links:[Link](http://arxiv.org/abs/2404.12253),[Document](https://dx.doi.org/10.48550/arXiv.2404.12253)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- J\. Uesato, N\. Kushman, R\. Kumar, F\. Song, N\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins \(2022\)Solving math word problems with process\- and outcome\-based feedback\.arXiv\.Note:arXiv:2211\.14275 \[cs\]External Links:[Link](http://arxiv.org/abs/2211.14275),[Document](https://dx.doi.org/10.48550/arXiv.2211.14275)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- K\. Valmeekam, M\. Marquez, A\. Olmo, S\. Sreedharan, and S\. Kambhampati \(2023\)PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change\.arXiv\.Note:arXiv:2206\.10498 \[cs\]External Links:[Link](http://arxiv.org/abs/2206.10498),[Document](https://dx.doi.org/10.48550/arXiv.2206.10498)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. X\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024a\)Math\-Shepherd: Verify and Reinforce LLMs Step\-by\-step without Human Annotations\.arXiv\.Note:arXiv:2312\.08935 \[cs\]External Links:[Link](http://arxiv.org/abs/2312.08935),[Document](https://dx.doi.org/10.48550/arXiv.2312.08935)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- X\. Wang, L\. Caccia, O\. Ostapenko, X\. Yuan, W\. Y\. Wang, and A\. Sordoni \(2024b\)Guiding Language Model Reasoning with Planning Tokens\.arXiv\.Note:arXiv:2310\.05707 \[cs\]External Links:[Link](http://arxiv.org/abs/2310.05707),[Document](https://dx.doi.org/10.48550/arXiv.2310.05707)Cited by:[§C\.3](https://arxiv.org/html/2605.24140#A3.SS3.SSS0.Px4.p1.19),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-Consistency Improves Chain of Thought Reasoning in Language Models\.arXiv\.Note:arXiv:2203\.11171 \[cs\]External Links:[Link](http://arxiv.org/abs/2203.11171),[Document](https://dx.doi.org/10.48550/arXiv.2203.11171)Cited by:[§C\.3](https://arxiv.org/html/2605.24140#A3.SS3.SSS0.Px2.p1.3),[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-Thought Prompting Elicits Reasoning in Large Language Models\.arXiv\.Note:arXiv:2201\.11903 \[cs\]External Links:[Link](http://arxiv.org/abs/2201.11903),[Document](https://dx.doi.org/10.48550/arXiv.2201.11903)Cited by:[§1](https://arxiv.org/html/2605.24140#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.
- Y\. Xu, X\. Guo, Z\. Zeng, and C\. Miao \(2025\)SoftCoT: Soft Chain\-of\-Thought for Efficient Reasoning with LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 23336–23351\.External Links:ISBN 979\-8\-89176\-251\-0,[Link](https://aclanthology.org/2025.acl-long.1137/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1137)Cited by:[§C\.3](https://arxiv.org/html/2605.24140#A3.SS3.SSS0.Px7.p1.6),[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px3.p1.1)\.
- M\. Yang, R\. S\. B\. B, A\. Feng, B\. Xiong, J\. Liu, I\. King, and R\. Ying \(2026\)Hyperbolic Fine\-Tuning for Large Language Models\.arXiv\.Note:arXiv:2410\.04010 \[cs\]External Links:[Link](http://arxiv.org/abs/2410.04010),[Document](https://dx.doi.org/10.48550/arXiv.2410.04010)Cited by:[§2\.3](https://arxiv.org/html/2605.24140#S2.SS3.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of Thoughts: Deliberate Problem Solving with Large Language Models\.\(en\)\.Cited by:[§C\.3](https://arxiv.org/html/2605.24140#A3.SS3.SSS0.Px3.p1.4),[§1](https://arxiv.org/html/2605.24140#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.24140#S3.SS1.SSS0.Px1.p1.12),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px3.p1.1)\.
- F\. Yu, A\. Gao, and B\. Wang \(2024\)OVM, Outcome\-supervised Value Models for Planning in Mathematical Reasoning\.InFindings of the Association for Computational Linguistics: NAACL 2024,K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 858–875\.External Links:[Link](https://aclanthology.org/2024.findings-naacl.55/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-naacl.55)Cited by:[§C\.3](https://arxiv.org/html/2605.24140#A3.SS3.SSS0.Px6.p1.3),[§4\.1](https://arxiv.org/html/2605.24140#S4.SS1.SSS0.Px3.p1.1)\.
- D\. Zhang, S\. Zhoubian, Z\. Hu, Y\. Yue, Y\. Dong, and J\. Tang \(2024\)ReST\-MCTS\*: LLM Self\-Training via Process Reward Guided Tree Search\.arXiv\.Note:arXiv:2406\.03816 \[cs\]External Links:[Link](http://arxiv.org/abs/2406.03816),[Document](https://dx.doi.org/10.48550/arXiv.2406.03816)Cited by:[§2\.2](https://arxiv.org/html/2605.24140#S2.SS2.p1.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le, and E\. Chi \(2023\)Least\-to\-Most Prompting Enables Complex Reasoning in Large Language Models\.arXiv\.Note:arXiv:2205\.10625 \[cs\]External Links:[Link](http://arxiv.org/abs/2205.10625),[Document](https://dx.doi.org/10.48550/arXiv.2205.10625)Cited by:[§2\.1](https://arxiv.org/html/2605.24140#S2.SS1.p1.1)\.

## Appendix ATask Formalism

This appendix gives the full formal specification of the multi\-step reasoning task referenced in Section[3\.1](https://arxiv.org/html/2605.24140#S3.SS1)\.

A problem instance is a tuple\(s0,g,𝒜,δ\)\(s\_\{0\},g,\\mathcal\{A\},\\delta\), wheres0s\_\{0\}is the initial state \(e\.g\., the number pool and target in Game of 24, or the premise set and goal fact in rule\-chaining\),ggis the goal condition,𝒜​\(s\)\\mathcal\{A\}\(s\)is the set of admissible single\-step operations in statess, andδ​\(s,a\)\\delta\(s,a\)is a deterministic transition that applies actionaato statess\. A reasoning trace of lengthTTis a sequenceτ=\(s0,a1,s1,…,aT,sT\)\\tau=\(s\_\{0\},a\_\{1\},s\_\{1\},\\dots,a\_\{T\},s\_\{T\}\)withst=δ​\(st−1,at\)s\_\{t\}=\\delta\(s\_\{t\-1\},a\_\{t\}\)andat∈𝒜​\(st−1\)a\_\{t\}\\in\\mathcal\{A\}\(s\_\{t\-1\}\), and is*successful*ifsTs\_\{T\}satisfiesgg\.

##### Search tree and distance\-to\-solution\.

The set of all states reachable froms0s\_\{0\}under\(𝒜,δ\)\(\\mathcal\{A\},\\delta\)forms the*search tree*𝒯s0,g\\mathcal\{T\}\_\{s\_\{0\},g\}; a states∈𝒯s0,gs\\in\\mathcal\{T\}\_\{s\_\{0\},g\}lies on a*solution path*iff some successful trace passes through it\. We define the*distance\-to\-solution*ofssas

d​\(s\)=min⁡\{T−ts:τ​is successful and​s=sts​in​τ\},d\(s\)\\;=\\;\\min\\big\\\{\\,T\-t\_\{s\}\\,:\\,\\tau\\text\{ is successful and \}s=s\_\{t\_\{s\}\}\\text\{ in \}\\tau\\,\\big\\\},\(7\)withd​\(s\)=∞d\(s\)=\\inftywhen no successful trace passes throughss; equivalently,d​\(s\)d\(s\)is the minimum BFS edge distance fromssto a successful leaf in𝒯s0,g\\mathcal\{T\}\_\{s\_\{0\},g\}\.

##### Policy and evaluation\.

A policy is a distributionπ​\(a∣s\)\\pi\(a\\mid s\)over𝒜​\(s\)\\mathcal\{A\}\(s\); at inference,π\\piis unrolled froms0s\_\{0\}until eitherggis satisfied or a fixed step budget is exhausted\. We report two quantities throughout:*accuracy*, the probability that a rollout ofπ\\piis successful, and*inference cost*, the number of LLM forward passes consumed per problem\.

## Appendix BAdditional Experimental Settings

### B\.1Dataset Construction

We evaluate on a suite ofeightreasoning benchmarks that span arithmetic, classical planning, constraint satisfaction, multi\-hop logical inference, and relational reasoning\. Each benchmark is provided as three disjoint JSONL splits \(train / validation / test\) with sizes summarized in Table[5](https://arxiv.org/html/2605.24140#A2.T5)\. To prevent leakage between the policy \(π\\pi\), the SFT prefix tuner, the optional outcome value model \(OVM\), and the hyperbolic distortion head, all four components are trained on the*train*split only; the*validation*split is used for hyper\-parameter selection and early stopping; the*test*split is held out and used exclusively for the numbers reported in Section[4](https://arxiv.org/html/2605.24140#S4)\.

Table 5:Split sizes for the eight reasoning datasets used in this work\. Splits are constructed by random partitioning of independently sampled instances; for ProntoQA, ProofWriter and MATH we additionally enforce a disjoint surface\-form \(entity / predicate / theory\) partition between train and test\. A dash \(–\) indicates that no validation split is released for that dataset: BW and GC are small benchmarks for which we reuse the hyperparameters selected on the larger datasets \(RC, PQ, PW, MT\) instead of holding out a separate validation set\.DatasetTrainValTestReasoning skillGame\-of\-24 \(G24\)7 634300100arithmetic searchBlocksworld \(BW\)600–350STRIPS planningGraph Coloring \(GC\)1 000–500CSP / backtrackingN\-Queens \(NQ\)2941581combinatorial searchProntoQA \(PQ\)3 000500800taxonomic entailmentProofWriter \(PW\)2 000200500open\-world inferenceRuleChain \(RC\)6 000600600forward chainingMATH \(MT\)7 500–500competition math#### B\.1\.1Common construction recipe

Every instancex∈𝒟x\\in\\mathcal\{D\}is generated together with a ground\-truth solution tracey⋆=\(s0,a1,s1,…,aT,sT\)y^\{\\star\}=\(s\_\{0\},a\_\{1\},s\_\{1\},\\dots,a\_\{T\},s\_\{T\}\)produced by a problem\-specific oracle solver\. Each instance is stored as a JSON record with at least the fields:prompt\(natural\-language statement\),init\_state\_text\(the initial state used by the planner\),answer\_label\(the canonical step\-by\-step solution\), plus the structured fields needed by the oracle \(graph, rules, permutation, etc\.\)\. Splits are produced by: \(i\) sampling instances from a parametric problem distribution \(controlled by difficulty knobs listed below\), \(ii\) discarding unsatisfiable or trivially solvable instances via the oracle, and \(iii\) partitioning the surviving instances uniformly at random into train / val / test, with a final de\-duplication pass on thepromptfield to enforce zero string\-level overlap across splits\.

#### B\.1\.2Per\-dataset details and examples

##### Game\-of\-24 \(G24\)\.

Each instance is a multiset of four integers in\[1,13\]\[1,13\]; the goal is to combine them with\+,−,×,÷\+,\-,\\times,\\div\(and arbitrary parenthesization\) to reach2424\. We retain only*solvable*multisets and label each with a single canonical 3\-step trace produced by an exhaustive oracle\. Difficulty is controlled by the number of distinct admissible solutions \(≤6\\leq 6in our pool, to keep the search non\-trivial\)\.

G24· Game\-of\-24, arithmetic searchInput\.1 4 4 12Goal\.Combine the four numbers with\+,−,×,÷\+,\-,\\times,\\divand parentheses to reach2424\.Trace\.Step 1\.1−4=−31\-4=\-3\. Remaining:\-3 4 12Step 2\.−3×4=−12\-3\\times 4=\-12\. Remaining:\-12 12Step 3\.12−\(−12\)=2412\-\(\-12\)=24\.Answer\.2424\.

##### Blocksworld \(BW\)\.

Standard STRIPS planning over44–55blocks with the actionspick\-up,put\-down,stack,unstack\. Initial and goal configurations are sampled uniformly over reachable states; instances whose optimal plan length is≤2\\leq 2are filtered out to keep planning depth≥3\\geq 3\.

BW· Blocksworld, STRIPS planningInitial state\.red, orange, yellow are clear; the hand is empty; yellow is on top of blue; red, blue, orange are on the table\.Goal\.red on yellow, blue on orange, yellow on blue\.Plan\.1\.unstack yellow from blue2\.stack yellow on red3\.pick up blue4\.stack blue on orange5\.unstack yellow from red6\.stack yellow on blue7\.pick up red8\.stack red on yellow

##### Graph Coloring \(GC\)\.

We sample Erdős–Rényi graphsG​\(n,p\)G\(n,p\)withn∈\{7,8\}n\\in\\\{7,8\\\}and edge densityp∈\{0\.4,0\.5,0\.6\}p\\in\\\{0\.4,0\.5,0\.6\\\}\. Each instance is verified satisfiable by a DPLL oracle, which also produces the labelled coloring used as the gold trace\. Colors are drawn from\{R,G,B\}\\\{R,G,B\\\}\.

GC· Graph Coloring, CSP / backtrackingVertices\.V0,…,V5V\_\{0\},\\dots,V\_\{5\}\.Edges\.\(V0,V1\),\(V0,V4\),\(V0,V5\),\(V1,V2\),\(V1,V4\),\(V\_\{0\},V\_\{1\}\),\(V\_\{0\},V\_\{4\}\),\(V\_\{0\},V\_\{5\}\),\(V\_\{1\},V\_\{2\}\),\(V\_\{1\},V\_\{4\}\),\(V2,V3\),\(V2,V5\),\(V3,V4\),\(V3,V5\),\(V4,V5\)\(V\_\{2\},V\_\{3\}\),\(V\_\{2\},V\_\{5\}\),\(V\_\{3\},V\_\{4\}\),\(V\_\{3\},V\_\{5\}\),\(V\_\{4\},V\_\{5\}\)\.Coloring\.V0=R,V1=G,V2=B,V3=R,V4=B,V5=GV\_\{0\}=R,\\;V\_\{1\}=G,\\;V\_\{2\}=B,\\;V\_\{3\}=R,\\;V\_\{4\}=B,\\;V\_\{5\}=G\.

##### N\-Queens \(NQ\)\.

Train and validation instances placeN=7N=7queens on a7×77\\times 7board, with a randomly chosen partial prefix of lengthk∈\{0,…,4\}k\\in\\\{0,\\dots,4\\\}that is consistent with at least one full solution\. Test instances are*out\-of\-distribution*: they placeN=8N=8queens withk=0k=0and require a full solution from scratch\. The oracle returns a single canonical extension via standard backtracking\.

NQ· N\-Queens, combinatorial searchSetting\.N=7N=7, prefix\[1,4,7,3\]\[1,4,7,3\]\(columns of queens in rows 1–4\)\.Gold extension\.\[1,4,7,3,6,2,5\]\[1,4,7,3,\\,6,\\,2,\\,5\]\.

##### ProntoQA \(PQ\)\.

Each instance is a fictional taxonomy of made\-up entity types \(*dumpus*,*rompus*, …\) with a chain of universal\-quantifier rules of lengthL∈\{3,4,5\}L\\in\\\{3,4,5\\\}, plus a binary True/False query\. We generate data with the official ProntoQA generatorSaparov and He \[[2023](https://arxiv.org/html/2605.24140#bib.bib35)\]\(asaparov/prontoqa\) using thefictionalontology under therandomordering, with5,0005\{,\}000trials, zero few\-shot examples, and chain length swept via\-\-min\-hops 3 \-\-max\-hops 5 \-\-hops\-skip 1; concretely, we invokerun\_experiment\.py \-\-model\-name json \-\-model\-size dummy \-\-ordering random \-\-num\-trials 5000 \-\-few\-shot\-examples 0 \-\-ontology fictional \-\-min\-hops 3 \-\-max\-hops 5 \-\-hops\-skip 1\. The resulting instances are partitioned uniformly at random into3,0003\{,\}000train /500500validation /800800test, with disjoint pools of fictional type\-names between splits\.

PQ· ProntoQA, taxonomic entailmentContext\.Dumpuses are transparent\. Dumpuses are impuses\. Impuses are not brown\. Every impus is a rompus\. Rompuses are floral\. Rompuses are yumpuses\. Yumpuses are happy\. Yumpuses are jompuses\. Every jompus is not temperate\. Jompuses are numpuses\. … Stella is a dumpus\.Query\.Is the following statement true or false? “Stella is not temperate\.”Answer\.True \(via the chain dumpus→\\toimpus→\\torompus→\\toyumpus→\\tojompus⇒\\Rightarrownot temperate\)\.

##### ProofWriter \(PW\)\.

Open\-world rule\-based reasoning over a small theory of entities, attributes and binary relations\. Each instance contains∼\\sim7 initial facts and∼\\sim8 rules \(some negated\)\. The oracle is a forward\-chaining engine; we annotate every instance with its*question depth*QDep∈\{0,1,2,3\}\\in\\\{0,1,2,3\\\}, and report stratified accuracy by depth\.

PW· ProofWriter, open\-world inferenceTheory\.The cat is nice\. The cat is young\. The cat likes the mouse\. The mouse is nice\. The mouse is young\. The mouse likes the cat\. The mouse visits the cat\.Rules\.\(R1\) If someone visits the cat and the cat is nice then the cat sees the mouse\. \(R3\) If someone sees the mouse and the mouse is young then they are big\. \(R7\) If someone is big and they see the mouse then the mouse sees the cat\.…Query\.Erin is round?Answer\.True \(QDep=1=1\)\.

##### RuleChain \(RC\)\.

A purely synthetic propositional benchmark that we introduce to control chain length precisely\. Each instance hasnpred=16n\_\{\\text\{pred\}\}=16propositional symbols\{p0,…,p15\}\\\{p\_\{0\},\\dots,p\_\{15\}\\\},nrules=18n\_\{\\text\{rules\}\}=18Horn rules, a small set of initial facts, and a target predicate reachable in exactlynsteps∈\{2,3,4\}n\_\{\\text\{steps\}\}\\in\\\{2,3,4\\\}forward\-chaining steps\. Distractor rules are added so that greedy or single\-hop strategies fail\.

RC· RuleChain, forward chainingRules \(excerpt\)\.ifp2p\_\{2\}thenp9p\_\{9\}; ifp9p\_\{9\}thenp14p\_\{14\}; ifp7p\_\{7\}thenp5p\_\{5\}; ifp5p\_\{5\}thenp9p\_\{9\}; ifp13p\_\{13\}andp8p\_\{8\}thenp3p\_\{3\}; …Initial facts\.\{p10,p2,p3,p6\}\\\{p\_\{10\},p\_\{2\},p\_\{3\},p\_\{6\}\\\}\.Goal\.derivep14p\_\{14\}\.Gold proof\.Step 1\.apply “ifp2p\_\{2\}thenp9p\_\{9\}”\.Step 2\.apply “ifp9p\_\{9\}thenp14p\_\{14\}”\.Answer\.p14p\_\{14\}is derived\.

##### MATH \(MT\)\.

The 500 representative competition\-level problems sampled by Lightman et al\.Lightmanet al\.\[[2023](https://arxiv.org/html/2605.24140#bib.bib179)\]from the MATH test set, spanning algebra, number theory, geometry, counting and probability, precalculus, and intermediate algebra\. Each instance is a natural\-language problem whose solution requires several lines of algebraic or combinatorial manipulation to derive a single closed\-form answer\. Unlike the other Group B datasets, MATH has no enumerable derivation tree: states are open\-ended natural\-language strings, so exact distance\-to\-solution is unavailable, and we instead train the head with the Monte\-Carlod^​\(s\)\\hat\{d\}\(s\)estimator described in Section[3\.3](https://arxiv.org/html/2605.24140#S3.SS3)\.

MT· MATH, competition\-level mathProblem\.“Solve forxx:x\+5=x−1\\sqrt\{x\+5\}=x\-1\.”Gold chain\.Square both sides:x\+5=\(x−1\)2=x2−2​x\+1x\+5=\(x\-1\)^\{2\}=x^\{2\}\-2x\+1, givingx2−3​x−4=0x^\{2\}\-3x\-4=0, so\(x−4\)​\(x\+1\)=0\(x\-4\)\(x\+1\)=0andx∈\{4,−1\}x\\in\\\{4,\-1\\\}\. Rejectx=−1x=\-1because4=2≠−2\\sqrt\{4\}=2\\neq\-2\.Answer\.x=4x=4\.

The MATH\-500 evaluation uses the public HuggingFaceH4/MATH\-500 split in full \(all500500problems spanning seven subjects and difficulty levels 1–5; no subset selection or filtering on our part\)\. Decoding is greedy with a10241024\-token budget per problem; the prompt is a fixed four\-shot Minerva\-style chain\-of\-thought template wrapped in each model’s chat template, with no per\-task tuning\. Scoring extracts the final boxed expression from the generation, normalizes it \(whitespace, fractions, units\) and compares to the reference; on a string\-match miss we fall back to a SymPy equivalence check, so accuracies are neither inflated by lenient matching nor deflated by purely syntactic differences\. Our 71\.2% on Qwen2\.5\-14B\-Instruct is roughly five points below the third\-party MATH\-500 numbers commonly reported for this model and roughly nine points below Qwen’s own MATH\-full figure of80\.080\.0, gaps consistent with the difference between the Minerva prompt and Qwen’s official prompt format and with our10241024\-token truncation on the longest level\-5 solutions\. SoftCoT training details for MATH\-500 are documented in Appendix[C\.3](https://arxiv.org/html/2605.24140#A3.SS3)\.

#### B\.1\.3Quality control

For every dataset we run two automated audits before training: \(a\) the oracle is re\-executed on every instance to verify that the storedanswer\_labelis reproducible bit\-for\-bit; \(b\) the SHA\-1 of thepromptfield is computed across all three splits to confirm that train, val and test are pairwise disjoint\. Instances that fail either check are removed and not counted in Table[5](https://arxiv.org/html/2605.24140#A2.T5)\.

### B\.2Task\-agnostic Dataset Construction

To train a single group\-level LoRA adapter that transfers across structurally related tasks \(Section[3\.3](https://arxiv.org/html/2605.24140#S3.SS3)\), we augment each group’s lead in\-domain dataset, Game of 24 for Group A and Rule\-chaining for Group B, with synthetic*\(context, goal\)*pairs harvested from the enumerated reasoning trees of its training instances\. The augmentation exploits the fact that, within a group, the reasoning\-tree motif \(state\-reduction or state\-expansion\) is shared across tasks, so any internal node of a training tree can itself be reframed as a fresh problem whose goal is any terminal value reachable from that node\.

##### Group A \(state\-reduction\)\.

For each Game\-of\-24 training instance, we enumerate the full state tree and emit one*\(context, goal\)*pair per reachable internal node, with the context being the current set of operands and operations applied so far and the goal being any terminal value reachable from that node, not necessarily 24\. This generalises the supervision from a single fixed objective to an arbitrary target value, exposing the adapter to a broader distribution of state\-reduction subproblems while keeping the underlying tree topology fixed\.

##### Group B \(state\-expansion\)\.

For each Rule\-chaining training instance, we similarly enumerate the forward\-chaining derivation tree and sample*\(context, goal\)*pairs at every reachable derived fact: the context is the set of premises plus partial derivations up to that node, and the goal is any fact downstream of that node along a valid Horn\-clause chain\. This exposes the adapter to chains of varying length and varying target facts, isolating the state\-expansion motif from any specific target\.

For both groups, the augmented training set is filtered for surface\-form overlap against the validation and test splits of every dataset in the group, so that no augmented*\(context, goal\)*pair shares a problem with a held\-out evaluation instance\. Table[6](https://arxiv.org/html/2605.24140#A2.T6)summarises the resulting augmented training set sizes alongside their lead\-dataset source\.

Table 6:Task\-agnostic augmented training sets used to train one group\-level LoRA adapter per group\. Each augmented*\(context, goal\)*pair is harvested from an internal node of the enumerated reasoning tree of a lead\-dataset training instance\.GroupLead dataset\# Source instances\# Augmented pairsAvg\. pairs / instanceA \(state\-reduction\)Game of 241,09016,46415\.1B \(state\-expansion\)Rule\-chaining6,00070,24611\.7

### B\.3Tree Statistics

Table[7](https://arxiv.org/html/2605.24140#A2.T7)reports per\-task structural statistics of the enumerated reasoning trees\. For Group A the state\-reduction motif makes exhaustive enumeration tractable, so we measure branching factor, depth and dead\-end ratio directly on the enumerated trees\. For Group B the state\-expansion motif precludes exhaustive enumeration, so the analogous statistics are estimated from the test\-set problem distributions instead\.

Table 7:Tree statistics for all seven enumerable\-tree benchmarks\. For Group A: branching factor is averaged over internal nodes, depth is the maximum solution\-path length, and dead\-end ratio is the fraction of terminal leaves that are not successful\. For Group B: branching factor is the average number of applicable rules at internal nodes, depth is the maximum gold chain length, and dead\-end ratio is the fraction of rule applications that do not lead to the target conclusion\.†\\daggerBlocksworld actions are reversible \(no absorbing dead\-end states\), so we instead report the fraction of actions that do not lie on a shortest plan to the goal\.DatasetBranching factor \(avg\)Depth \(max\)Dead\-end ratio*Group A*Game of 246\.59399\.4%N\-Queens \(N=8N\{=\}8\)1\.49877\.7%Blocksworld2\.271690\.9%†Graph Coloring1\.56832\.4%*Group B*Rule\-chaining1\.65447\.7%ProntoQA2\.65570\.9%ProofWriter3\.96374\.3%
### B\.4Hyperparameters

Table[8](https://arxiv.org/html/2605.24140#A2.T8)summarises the hyperparameters used across the two training stages and at inference\. Settings are shared across the eight benchmarks in Table[5](https://arxiv.org/html/2605.24140#A2.T5)unless explicitly noted; per\-dataset early stopping is performed on the validation split\. Values were selected by a small grid search on Game\-of\-24 and ProofWriter and held fixed for the remaining six datasets\.

Table 8:Hyperparameters used for HyperGuide training and inference\.ArchitectureBackboneQwen2\.5 \(frozen except LoRA\)LoRA rankrr1616LoRAα\\alpha3232LoRA dropout0\.050\.05LoRA target modules\{q,k,v,o\}​\_\\\{q,k,v,o\\\}\\\_proj of every attention layerProjection headhϕh\_\{\\phi\}2\-layer MLP, hidden10241024, GELU activationHyperbolic embedding dimnn128128Up\-projectorgψg\_\{\\psi\}2\-layer MLP, hidden20482048, GELU activationUp\-projector output dimper\-backbone, matches the backbone’s residual hidden size \(51205120for Qwen2\.5\-14B\-Instruct\)Curvatureccinit1\.01\.0\(learnable scalar\)Stage 1: ranking\-supervised head trainingOptimiserAdamW,β1=0\.9,β2=0\.999,ϵ=10−8\\beta\_\{1\}\{=\}0\.9,\\ \\beta\_\{2\}\{=\}0\.999,\\ \\epsilon\{=\}10^\{\-8\}Learning rate1​e−31\\mathrm\{e\}\{\-3\}Weight decay0\.010\.01LR schedulecosine with linear warmupWarmup steps500500Epochs2020Effective batch size6464\(88GPUs×\\times88per\-GPU\)Ranking\-loss margin0\.10\.1Gradient clipping1\.01\.0Mixed precisionbf16Stage 1 Monte\-Carlo variant \(MATH only\)Number of rolloutsKK3232Rollout temperatureτmc\\tau\_\{\\mathrm\{mc\}\}0\.80\.8Rollout nucleuspp0\.950\.95Max tokens per rollout10241024Importance\-weight decayη\\eta0\.950\.95Stage 2: DAgger adapter fine\-tuningOptimiserAdamWLearning rate5​e−55\\mathrm\{e\}\{\-5\}\(LoRA parameters only\)Weight decay0\.00\.0LR schedulecosine, warmup ratio0\.030\.03DAgger epochs55Rollouts per training problem88Rollout temperatureτ\\tau0\.70\.7Nucleuspp0\.90\.9Max rollout tokens512512Oracle–policy mixingβ\\betaannealed1\.0→0\.01\.0\\to 0\.0across epochsEffective batch size6464\(1616per\-GPU×\\timesgrad\-accum44\)Mixed precisionbf16Inference \(HyperGuide and single\-pass baselines\)DecodinggreedyMax new tokens512512Re\-encode intervalevery step boundaryInference \(Tree of Thoughts baseline\)Search procedureBFSBeam widthbb55Search depthDD33Value\-prompt votes per candidate33
### B\.5Sensitivity Analysis

We report how the three most influential continuous hyperparameters affect accuracy on Game of 24 \(Group A\) and Rule\-chaining \(Group B\), varying one at a time while keeping all others at the defaults in Table[8](https://arxiv.org/html/2605.24140#A2.T8)\. The embedding dimensionnnis already analysed as part of the ablation study \(Section[4](https://arxiv.org/html/2605.24140#S4)\); we therefore focus here on the loss\-level hyperparameters\.

##### Ranking\-loss marginγ\\gamma\.

Table 9:Accuracy \(%\) as a function of the ranking\-loss marginγ\\gamma\(Equation[3](https://arxiv.org/html/2605.24140#S3.E3)\)\. Default value inbold\.γ\\gammaGame of 24Rule\-chaining0\.050\.0554780\.1\\mathbf\{0\.1\}57800\.20\.256790\.50\.55277Accuracy follows a flat inverted\-U around the default: too small a margin leaves many state pairs trivially satisfied and the radial axis under\-shaped, while too large a margin is unsatisfiable for many pairs and injects gradient noise\. Game of 24 is the more sensitive of the two tasks, consistent with its heavier reliance on a clean radial signal \(high dead\-end ratio in Table[7](https://arxiv.org/html/2605.24140#A2.T7)\)\.

##### Metric\-loss weightλ\\lambda\.

Table 10:Accuracy \(%\) as a function of the metric\-loss weightλ\\lambda\(Equation[2](https://arxiv.org/html/2605.24140#S3.E2)\)\. Default value inbold\.λ\\lambdaGame of 24Rule\-chaining0\.10\.15577\.50\.50\.557781\.0\\mathbf\{1\.0\}57802\.02\.05377The two endpoints fail asymmetrically: atλ=0\.1\\lambda\{=\}0\.1the structural axis is under\-trained and Rule\-chaining \(which leans on tree structure\) drops more than Game of 24, whereas atλ=2\.0\\lambda\{=\}2\.0the radial term is starved and Game of 24 drops more than Rule\-chaining\. This contrast shows thatℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}andℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}contribute distinct, non\-redundant supervision and motivates keeping both terms\.

##### Metric\-loss marginγ′\\gamma^\{\\prime\}\.

Table 11:Accuracy \(%\) as a function of the metric\-loss marginγ′\\gamma^\{\\prime\}\(Equation[4](https://arxiv.org/html/2605.24140#S3.E4)\)\. Default value inbold\.γ′\\gamma^\{\\prime\}Game of 24Rule\-chaining0\.050\.0554780\.1\\mathbf\{0\.1\}57800\.20\.256790\.50\.55377The metric margin shows the same flat inverted\-U asγ\\gamma, with comparable swing magnitudes; the parallel between the two confirms that both hinge losses behave well\-conditioned across roughly an order of magnitude around their defaults\.

## Appendix CTraining and Inference Details

### C\.1HyperGuide Implementation

All architecture and optimisation hyperparameters are listed in Table[8](https://arxiv.org/html/2605.24140#A2.T8)\. Training runs on8×8\\timesNVIDIA RTX A6000 GPUs; efficiency experiments use a single A6000 to measure single\-GPU inference cost\.

### C\.2Monte\-Carlo Head Training Details

This section provides the full specification of the Monte\-Carlo variant of Stage 1 used when the reasoning tree cannot be exhaustively enumerated \(Section[3\.2](https://arxiv.org/html/2605.24140#S3.SS2)\)\. In our experiments this variant is used exclusively for the MATH benchmark\.

##### Rollout procedure\.

For each training\-set problemxxwith ground\-truth answery⋆y^\{\\star\}, we sampleK=32K\{=\}32independent rollouts from the SFT\-merged base model \(i\.e\. the same backbone used throughout the pipeline, after task\-specific supervised fine\-tuning but before any LoRA or projection\-head training\) at temperatureτmc=0\.8\\tau\_\{\\mathrm\{mc\}\}\{=\}0\.8and nucleusp=0\.95p\{=\}0\.95, with a maximum budget of 1024 generated tokens per rollout\. A rollout is labelled*successful*if its final boxed expression matchesy⋆y^\{\\star\}under the same normalisation and SymPy equivalence check used at evaluation \(Appendix[B\.1](https://arxiv.org/html/2605.24140#A2.SS1)\)\.

##### Value estimate\.

The Monte\-Carlo value estimate at statessis

d^​\(s\)=1−1K​∑i=1K𝟙​\[ρi​reaches the goal from​s\],\\hat\{d\}\(s\)\\;=\\;1\-\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\mathbb\{1\}\\bigl\[\\rho\_\{i\}\\text\{ reaches the goal from \}s\\bigr\],\(8\)so thatd^​\(s\)=0\\hat\{d\}\(s\)=0for states from which every rollout succeeds \(closest to a solution\) andd^​\(s\)=1\\hat\{d\}\(s\)=1for states from which none does \(dead end\)\. Althoughd^​\(s\)∈\[0,1\]\\hat\{d\}\(s\)\\in\[0,1\]lives on a different scale than the integer\-valued exactd​\(s\)d\(s\),ℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\(Equation[3](https://arxiv.org/html/2605.24140#S3.E3)\) only uses the order between distances, sod^​\(s\)\\hat\{d\}\(s\)is substituted in place ofd​\(s\)d\(s\)as a ranking signal\.

##### Trajectory\-local tree distances\.

Because the full tree is unavailable, pairwise tree distances forℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}\(Equation[4](https://arxiv.org/html/2605.24140#S3.E4)\) are extracted from rollout trajectories\. Two types of pairs are used:

1. 1\.Intra\-rollout pairs\.Consecutive states along a single rolloutρ\\rhosatisfyd𝒯​\(st,st\+k\)=kd\_\{\\mathcal\{T\}\}\(s\_\{t\},s\_\{t\+k\}\)=k\.
2. 2\.Inter\-rollout pairs\.Two rolloutsρ\(1\),ρ\(2\)\\rho^\{\(1\)\},\\rho^\{\(2\)\}that share a common prefix up to stepttand then diverge yieldd𝒯​\(st\+j\(1\),st\+k\(2\)\)=j\+kd\_\{\\mathcal\{T\}\}\\\!\\bigl\(s^\{\(1\)\}\_\{t\+j\},\\,s^\{\(2\)\}\_\{t\+k\}\\bigr\)=j\+k\.

##### Importance weighting\.

Trajectory\-local distances become less reliable as the offset from the shared prefix grows, because the inferred tree structure is based on a finite sample of rollouts rather than an exhaustive enumeration\. To down\-weight distant pairs, each triplet\(si,sj,sk\)\(s\_\{i\},s\_\{j\},s\_\{k\}\)inℒmetric\\mathcal\{L\}\_\{\\mathrm\{metric\}\}receives a multiplicative weight

w​\(si,sj,sk\)=ηd𝒯​\(si,sj\)\+d𝒯​\(si,sk\),w\(s\_\{i\},s\_\{j\},s\_\{k\}\)\\;=\\;\\eta^\{\\,d\_\{\\mathcal\{T\}\}\(s\_\{i\},\\,s\_\{j\}\)\\,\+\\,d\_\{\\mathcal\{T\}\}\(s\_\{i\},\\,s\_\{k\}\)\},\(9\)whereη=0\.95\\eta\{=\}0\.95is a fixed exponential decay factor\. This ensures that the loss is dominated by nearby, high\-confidence distance estimates while still receiving gradient from longer\-range pairs\.

##### Variance ofd^\\hat\{d\}\.

Since each rollout outcome is an independent Bernoulli trial with success probabilityp​\(s\)p\(s\), the variance ofd^​\(s\)\\hat\{d\}\(s\)isVar​\[d^​\(s\)\]=p​\(s\)​\(1−p​\(s\)\)/K\\mathrm\{Var\}\[\\hat\{d\}\(s\)\]=p\(s\)\(1\-p\(s\)\)/K\. AtK=32K\{=\}32the worst\-case standard deviation \(atp=0\.5p\{=\}0\.5\) is≈0\.088\\approx 0\.088, which is well below the ranking\-loss marginγ=0\.1\\gamma\{=\}0\.1used inℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\. For states that are clearly promising \(p≳0\.8p\\gtrsim 0\.8\) or clearly dead \(p≲0\.1p\\lesssim 0\.1\), the standard deviation drops below0\.050\.05, so the ranking loss receives a clean ordinal signal for the vast majority of training pairs\. We verified empirically that increasingKKfrom 32 to 64 on a held\-out subset of 500 MATH training problems changed fewer than 4% of pairwise ranking decisions inℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}, indicating thatK=32K\{=\}32is sufficient for stable head training\.

##### Hyperparameters\.

All Monte\-Carlo\-specific settings are listed in the dedicated sub\-section of Table[8](https://arxiv.org/html/2605.24140#A2.T8); the ranking\-loss margin is shared with the exact variant\.

### C\.3Baseline Implementations

This section documents the implementation details for every baseline reported in Section[4](https://arxiv.org/html/2605.24140#S4), in support of the fairness claims made there\. All baselines and HyperGuide share: \(i\) the same base model, Qwen2\.5\-14B\-Instruct; \(ii\) the same per\-task test split and per\-record identifier set; \(iii\) the same task\-specific scoring function; and \(iv\) the same hardware and numerical precision \(NVIDIA RTX A6000 GPUs, bfloat16 inference, 4\-bit NF4 quantization of the base when training adapters\)\. Only the inference algorithm and any baseline\-specific adapter weights differ across rows\.

##### Few\-shot Greedy\.

A single deterministic decode of the chat\-template\-wrapped task prompt, with a short task\-specific prefix that fixes the answer schema\. The decoding budget is384384generated tokens per problem\.

##### Self\-Consistency,K=5K\{=\}5\.

FollowingWanget al\.\[[2023](https://arxiv.org/html/2605.24140#bib.bib185)\], we draw five independent samples from the same prompt at temperature0\.70\.7and nucleusp=0\.95p\{=\}0\.95, parse the final answer of each sample with the task scorer, and report the majority answer\. Each sample uses the same384384\-token budget as Greedy\.

##### Tree\-of\-Thoughts \(BFS\)\.

We followYaoet al\.\[[2023](https://arxiv.org/html/2605.24140#bib.bib186)\]faithfully\. The base model serves both as the proposer and as the value model; a single set of weights is used for both roles\. At each search depth, every active trajectory is expanded with three sampled candidate next steps at temperature0\.70\.7; each candidate is scored three times by the value prompt, which asks the model to label the candidate as “sure”, “likely”, or “impossible” \(with token weights2020,11, and0\.0010\.001respectively, exactly as in the original work\)\. The five highest\-scoring trajectories are retained and the search continues to a maximum depth of twelve\. Each task uses a dedicated propose/value prompt pair so that the candidate format respects the task’s grammar \(Blocksworld actions, ProntoQA derivation steps, graph colorings, etc\.\)\. The Game\-of\-24 evaluation uses the verbatim four\-shot propose and value prompts from the original paper\. The total per\-problem generation budget is on the order of several hundred sampled completions, substantially exceeding the inference cost of HyperGuide; weak cells are therefore not attributable to budget\.

##### PT\-SFT \(Planning\-Token Supervised Fine\-Tuning\)\.

A re\-implementation of the planning\-token method ofWanget al\.\[[2024b](https://arxiv.org/html/2605.24140#bib.bib107)\]\. Each gold reasoning trajectory is annotated with a discrete operator tag immediately before each reasoning step, and a low\-rank adapter is fine\-tuned on the annotated trajectories with a completion\-only cross\-entropy objective \(loss is masked on prompt tokens\)\. Hyperparameters are held fixed across all eight tasks: rank1616adapters with scaling factor3232and dropout0\.050\.05, applied to every attention projection matrix; AdamW with learning rate1×10−41\{\\times\}10^\{\-4\}, cosine schedule with five percent warmup, gradient clipping at1\.01\.0, five training epochs, and gradient accumulation chosen so that the effective batch size is approximately1616\. The maximum sequence length is set per task to fit the longest gold trajectory \(384384–512512tokens for Game\-of\-24, ProntoQA, GraphColor, and N\-Queens;768768–10241024for ProofWriter, and Rule\-chain;15361536for Blocksworld\)\. Inference uses greedy decoding with the same prompt format used at training time, with400400generated tokens per problem\. We use the public training partition of each benchmark in full, without subsampling:7,6347\{,\}634examples for Game\-of\-24,2,0002\{,\}000for ProofWriter,6,0006\{,\}000for Rule\-chain,600600for Blocksworld,3,0003\{,\}000for ProntoQA,1,0001\{,\}000for GraphColor, and294294for N\-Queens\.

##### Note on PT\-SFT Blocksworld \(⋆\\star,96%96\\%\)\.

Our Blocksworld train and test splits are drawn from the same narrow distribution of44–55\-block configurations, all written under the same vocabulary and the same two\-shot prompt template\. Because the test split is generated from the same template with the same vocabulary, the fine\-tuned adapter can closely approximate the gold\-plan distribution at inference, and a non\-trivial fraction of greedy generations are minor variants of training trajectories\. This setup is therefore exceptionally memorization\-friendly and does not characterize how PT\-SFT generalizes on more compositional tasks\. We mark the cell with⋆\\starso the reader can discount it when judging the broader pattern\.

##### Outcome Value Model \(OVM\)\.

A re\-implementation ofYuet al\.\[[2024](https://arxiv.org/html/2605.24140#bib.bib28)\]on top of the PT\-SFT generator\. We freeze both the base model and the PT\-SFT adapter, and train only a scalar value head \(a small two\-layer MLP applied to the final hidden state\) with a per\-token mean\-squared\-error loss against the trajectory’s binary outcome label\. Training rollouts are sampled from the frozen generator at temperature1\.01\.0and nucleusp=0\.95p\{=\}0\.95,256256tokens per rollout, with forty rollouts per training problem, yielding between five and ten thousand labeled trajectories per task\. The value head is trained for two epochs with the same optimizer settings as PT\-SFT\. For Game\-of\-24, where only about six percent of rollouts are correct, we reweight the positive class by a factor of fifteen to balance the loss; other tasks use uniform weighting\.

At inference, OVM performs step\-level value\-guided beam search\. At each step, twenty single\-line continuations are sampled at temperature1\.01\.0from each beam, every candidate is scored by the value head \(computed at the candidate’s last token, with left\-padding to align positions across the batch\), and the five highest\-valued candidates are retained\. The search runs for at most ten steps, with sixty\-four tokens per step\. The total per\-problem generation budget is on the order of one thousand candidate steps, which again exceeds HyperGuide’s inference cost\. OVM is reported on every dataset and backbone in Tables[2](https://arxiv.org/html/2605.24140#S4.T2)and[3](https://arxiv.org/html/2605.24140#S4.T3)\.

##### SoftCoT\.

A re\-implementation ofXuet al\.\[[2025](https://arxiv.org/html/2605.24140#bib.bib22)\]\. A small Qwen2\.5\-1\.5B\-Instruct assistant emits a sequence of hidden states; a trainable linear projection maps these states into the embedding space of the base model, where they are spliced into the prompt immediately before generation\. Both language models are frozen and only the projection is trained\. We train the projection on a4,0004\{,\}000\-record subset of GSM8K for two epochs \(learning rate2×10−42\{\\times\}10^\{\-4\}, batch size11, gradient accumulation1616,3232thoughts during training,44at evaluation\)\. Because the original paper trains and evaluates on the same dataset, our cross\-domain \(GSM8K\-trained, MATH\-500\-evaluated\) and cross\-family \(Qwen\-trained projection applied to gpt\-oss and Mistral bases\) settings should be read as a stress test of transfer, not as a faithful reproduction\.

##### Hyperparameter tuning\.

We did not perform per\-task hyperparameter sweeps for any baseline beyond the choices documented above\. We believe the most likely cause of the remaining weak cells is method–task fit \(for example, the value model in ToT failing to discriminate well on ProofWriter true/false derivations\) rather than under\-tuning, but we leave a more exhaustive sweep to future work\.

##### Per\-call token budget\.

Inference\-time generation caps differ slightly across methods \(Few\-shot and Self\-Consistency:384384; PT\-SFT:400400; HyperGuide:512512\), each set to comfortably exceed the longest answer the method needs to emit: the128128\-token margin granted to HyperGuide functions as projection\-head headroom for the per\-step boundary read rather than as extra answer length, and across the seven tabular benchmarks the gold answer fits well within the384384\-token Few\-shot cap\. SoftCoT vs\. Few\-shot on MATH is the one comparison for which we explicitly match budgets, since soft\-token injection systematically extends generation length and matched\-budget evaluation isolates the algorithmic effect from the length effect\.

## Appendix DQualitative Decision Examples

Table[12](https://arxiv.org/html/2605.24140#A4.T12)traces a single Game\-of\-24 problem through four representative step\-boundary states, contrasting the next\-operation distribution with and without the injected geometric signal\.

Table 12:Single\-step decision snapshots from one Game\-of\-24 problem\[4,4,6,8\]→24\[4,4,6,8\]\\to 24\. All four rows come from the same problem tree:v=2v\{=\}2follows the gold step4\+8=124\{\+\}8\{=\}12,v=1v\{=\}1follows the gold step6−4=26\{\-\}4\{=\}2, and the dead\-end row is the state\[4,8,24\]\[4,8,24\]reached by the off\-tree choice4×6=244\{\\times\}6\{=\}24at the initial state \(which leaves44,88,2424unreachable\)\. Within this single tree, smallerd​\(0,z\)d\(0,z\)corresponds to states closer to a solution; the dead\-end branch carries the largestd​\(0,z\)d\(0,z\)as the head signals an unreachable state\.⋆\\starmarks the oracle\-correct op in each top\-3\.\[0pt\]BucketState𝒅​\(𝟎,𝒛\)\\boldsymbol\{d\(0,z\)\}Top\-3 withoutzzTop\-3 withzzOracle\-correct\[0pt\]v=3v\{=\}3Remaining:4,4,6,84,4,6,8\(target2424\)11\.5011\.508\+6=148\+6\{=\}14p=0\.31p\{=\}0\.314\+4=84\+4\{=\}8p=0\.24p\{=\}0\.24⋆4\+8=12\\star\\,4\+8\{=\}12p=0\.19p\{=\}0\.19⋆4\+8=12\\star\\,4\+8\{=\}12p=0\.62p\{=\}0\.628\+6=148\+6\{=\}14p=0\.12p\{=\}0\.124\+4=84\+4\{=\}8p=0\.08p\{=\}0\.084\+8=124\{\+\}8\{=\}12\[0pt\]v=2v\{=\}2Remaining:4,6,124,6,12\(target2424\)9\.509\.506\+12=186\+12\{=\}18p=0\.30p\{=\}0\.3012−4=812\-4\{=\}8p=0\.24p\{=\}0\.24⋆6−4=2\\star\\,6\-4\{=\}2p=0\.18p\{=\}0\.18⋆6−4=2\\star\\,6\-4\{=\}2p=0\.59p\{=\}0\.596\+12=186\+12\{=\}18p=0\.13p\{=\}0\.1312−4=812\-4\{=\}8p=0\.10p\{=\}0\.106−4=26\{\-\}4\{=\}2\[0pt\]v=1v\{=\}1Remaining:2,122,12\(target2424\)7\.507\.5012\+2=1412\+2\{=\}14p=0\.30p\{=\}0\.3012−2=1012\-2\{=\}10p=0\.24p\{=\}0\.24⋆2×12=24\\star\\,2\\times 12\{=\}24p=0\.21p\{=\}0\.21⋆2×12=24\\star\\,2\\times 12\{=\}24p=0\.65p\{=\}0\.6512\+2=1412\+2\{=\}14p=0\.10p\{=\}0\.1012−2=1012\-2\{=\}10p=0\.08p\{=\}0\.082×12=242\{\\times\}12\{=\}24\[0pt\] dead\-endRemaining:4,8,244,8,24\(target2424\)12\.3012\.3024−8=1624\-8\{=\}16p=0\.34p\{=\}0\.3424−4=2024\-4\{=\}20p=0\.27p\{=\}0\.2724\+4=2824\+4\{=\}28p=0\.20p\{=\}0\.2024−8=1624\-8\{=\}16p=0\.20p\{=\}0\.2024−4=2024\-4\{=\}20p=0\.18p\{=\}0\.1824\+4=2824\+4\{=\}28p=0\.16p\{=\}0\.16*\(unreachable\)*
## Appendix ELimitations

##### Bounded scale range\.

Our experiments span three open\-weight backbones from different families \(Qwen2\.5\-14B\-Instruct, GPT\-OSS\-20B, Mistral\-Small\-3\.2\-24B\) on NVIDIA RTX A6000 GPUs, so the gains are demonstrated to be robust across base models and hardware within the1414B–2424B dense\-decoder regime rather than tied to a single backbone\. We do not, however, characterise scaling behaviour outside this band: smaller open\-weight models, much larger \(≥70\\geq 70B\) backbones, and mixture\-of\-experts architectures are unexplored, and we therefore make no claim about a scaling law beyond the regime we cover\.

##### Minimal architectural search\.

We use a single learnable scalar curvature, a two\-layer projection head, and LoRA rank1616uniformly across attention modules\. Per\-layer curvature, larger ranks, and alternatives to the Poincaré ball \(e\.g\., the Lorentz model or product\-of\-curvatures geometries\) are unexplored, so the reported numbers likely lower\-bound rather than upper\-bound what the geometric inductive bias can deliver\.

##### Reasoning regime\.

Our out\-of\-domain evaluation covers two motif families with three transfer tasks each\. Tasks whose solution structure is not naturally tree\-shaped, such as long\-horizon dialogue or retrieval\-heavy QA, lie outside the regime our analysis covers; we make no claim about transfer there\.

Similar Articles

Hint-Guided Diversified Policy Optimization for LLM Reasoning

arXiv cs.CL

This paper introduces Hint-Guided Diversified Policy Optimization (HDPO), a two-stage RL framework that encourages LLMs to first generate multiple candidate solution outlines (hints) and then select the most reliable one for detailed reasoning, improving reasoning diversity and reliability.

Enhanced and Efficient Reasoning in Large Learning Models

arXiv cs.AI

This paper proposes a method for improving reasoning in large language models by recoding data to explicitly represent relationships, enabling efficient principled reasoning with polynomial-time learnability for relational rules, which addresses hallucinations and supports sound reasoning across multiple calls.