Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

arXiv cs.CL Papers

Summary

Introduces Thoughts-as-Planning, a framework that models chain-of-thought optimization as sequential decision-making using latent world models and reinforcement learning, outperforming existing methods in efficiency and generalization.

arXiv:2605.28842v1 Announce Type: new Abstract: The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:13 AM

# 1 Introduction
Source: [https://arxiv.org/html/2605.28842](https://arxiv.org/html/2605.28842)
Thoughts\-as\-Planning: Latent World Models for Chain\-of\-Thoughts Optimization via Reinforcement Planning

Dong LiuYanxuan YuYing Nian WuUniversity of California, Los Angelespikeliu@ucla\.eduColumbia Universityyy3523@columbia\.eduUniversity of California, Los Angelesywu@stat\.ucla\.edu

###### Abstract

The success of large language models \(LLMs\) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives\. Existing reasoning chain tuning methods often rely on black\-box heuristics or gradient\-free search, which lack interpretability, generalization, and sample efficiency\. In this work, we introduceThoughts\-as\-Planning, a novel framework that formalizes reasoning chain optimization as a sequential decision\-making process over a latent semantic space\. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs\. A proximity\-preserving embedding space is constructed to encode reasoning chain\-response dynamics, enabling planning via gradient descent or reinforcement learning\. Our method supports multi\-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner\. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts\-as\-Planning outperforms state\-of\-the\-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory\. Our code is available at[https://github\.com/FastLM/Thoughts\-as\-Planning](https://github.com/FastLM/Thoughts-as-Planning)\.

Large language models \(LLMs\) have demonstrated remarkable capabilities across a wide range of tasks, including reasoning, summarization, and dialogue generation\. Chain\-of\-thoughts \(CoT\) reasoning has emerged as a powerful technique for enhancing LLM reasoning capabilities by enabling step\-by\-step thinking processesWeiet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib4)\)\. However, the effectiveness of CoT reasoning is highly sensitive to the structure and quality of reasoning chains—the sequential thought steps that guide the model’s reasoning process\. Minor changes in reasoning step formulation, logical flow, or intermediate conclusions can lead to substantial variations in performanceWeiet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib4)\); Kojimaet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib5)\)\. As a result,*chain\-of\-thoughts optimization*has emerged as a central challenge in enhancing LLM reasoning capabilities\.

Despite its importance, most CoT engineering today remains manual, heuristic, or reliant on black\-box optimization techniques\. These approaches suffer from low data efficiency, poor generalization to new reasoning tasks or domains, and limited interpretability\. In response, recent work has explored*automated CoT generation*Zhanget al\.\([2022b](https://arxiv.org/html/2605.28842#bib.bib6)\)or*reasoning chain search*Liuet al\.\([2023](https://arxiv.org/html/2605.28842#bib.bib23)\), but these still operate under static optimization regimes, lacking the structure and dynamics inherent in multi\-step reasoning and planning\.

In this work, we propose a new perspective: we view chain\-of\-thoughts optimization as an instance ofplanning, where the goal is to iteratively refine reasoning chains in order to maximize reasoning performance\. Drawing inspiration from latent world modeling and model\-based reinforcement learning, we introduceThoughts\-as\-Planning, a framework that integrates learned latent dynamics, proximity\-based representations, and multi\-scale reasoning chain control into a unified planning pipeline\.

At the core of our method is a*latent world model*T^θ\\hat\{T\}\_\{\\theta\}that simulates how an LLM would respond to modified reasoning chains\. We embed both reasoning chains and outputs into a proximity\-preserving space, where planning can be performed via similarity\-based objectives\. To enable generalization and efficient exploration, we model reasoning chains as sequences of discrete editing actions, and learn a planning policy over this structured space using either gradient search or reinforcement learning\. Our framework naturally supports multi\-scale abstraction, allowing reasoning chain edits to occur at the token, reasoning\-step, or structural level\.

We evaluate Thoughts\-as\-Planning across a suite of mathematical reasoning, commonsense reasoning, and logical inference benchmarks\. Our results show that it consistently outperforms existing CoT optimization methods in both performance and data efficiency, and yields interpretable reasoning chain trajectories that can be reused or adapted across reasoning tasks\.

##### Contributions\.

Our main contributions are:

- •We proposeThoughts\-as\-Planning, a novel framework that formalizes chain\-of\-thoughts optimization as latent\-space planning via a learned world model\.
- •We develop amulti\-scale reasoning chain control policythat enables token\-level to structural edits under a unified framework\.
- •We providetheoretical analysisand detailed mathematical proofs for the convergence and optimality properties of our planning framework\.
- •We demonstrate through extensive experiments that our method improves efficiency, robustness, and generalization compared to existing baselines\.

### 1\.1Chain\-of\-Thoughts Optimization and Reasoning Enhancement

Chain\-of\-thoughts \(CoT\) optimization plays a central role in enhancing LLM reasoning capabilities without full fine\-tuning\. The seminal work by Wei et al\.Weiet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib4)\)introduced CoT reasoning, demonstrating that step\-by\-step reasoning significantly improves performance on complex reasoning tasks\. Subsequent research has explored various aspects of CoT optimization, including automatic CoT generationZhanget al\.\([2022b](https://arxiv.org/html/2605.28842#bib.bib6)\), few\-shot CoT learningKojimaet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib5)\), and CoT distillationFuet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib7)\)\. However, CoT sensitivity to reasoning chain structureWeiet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib4)\); Kojimaet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib5)\)has revealed the need for automated optimization of reasoning steps\.

Recent approaches include discrete methods such as AutoCoTZhanget al\.\([2022b](https://arxiv.org/html/2605.28842#bib.bib6)\), which automatically generates reasoning chains, and CoT optimization frameworksLiuet al\.\([2023](https://arxiv.org/html/2605.28842#bib.bib23)\)that search the discrete reasoning space via heuristic or evolutionary approaches\. On the other hand, soft reasoning embeddingsLesteret al\.\([2021](https://arxiv.org/html/2605.28842#bib.bib21)\); Li and Liang \([2021](https://arxiv.org/html/2605.28842#bib.bib22)\)learn continuous representations of reasoning patterns, but often lack interpretability\. Recent reasoning synthesis frameworksWanget al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib26)\)construct reasoning chains from demonstration banks\. Unlike these static or gradient\-free schemes, our work frames reasoning chain editing as a dynamic decision process over a latent model, allowing structured planning with sample efficiency and interpretability\.

### 1\.2Latent World Models and Abstract Reasoning

Latent world models have gained traction in reinforcement learning and abstract reasoning for their ability to simulate environment dynamics without direct observation\. DreamerHafneret al\.\([2019](https://arxiv.org/html/2605.28842#bib.bib14),[2020](https://arxiv.org/html/2605.28842#bib.bib28)\)and World ModelsHa and Schmidhuber \([2018](https://arxiv.org/html/2605.28842#bib.bib27)\)encode state transitions into compact latent spaces to enable efficient planning\. Inspired by these methods, our latent world model predicts the effect of reasoning chain edits on downstream reasoning outcomes, decoupling simulation from LLM querying\.

Recent advances in language model reasoning have introduced latent\-space abstraction mechanisms\. Latent Thought Language ModelsKonget al\.\([2025](https://arxiv.org/html/2605.28842#bib.bib12)\); Hoffmanet al\.\([2023](https://arxiv.org/html/2605.28842#bib.bib13)\)propose structured latent variables to encode intermediate thoughts in chain\-of\-thought reasoning\. These models learn to represent reasoning steps in a continuous latent space, enabling more efficient reasoning chain optimization\. LTM methods such asPlace CellsZhaoet al\.\([2025](https://arxiv.org/html/2605.28842#bib.bib1)\)model positional embeddings through proximity\-preserving transition kernels for navigation and path planning\. Others extend this framework to embodied agentsNohet al\.\([2025](https://arxiv.org/html/2605.28842#bib.bib2)\)and test\-time adaptive reasoning via latent policy gradientsLiet al\.\([2025](https://arxiv.org/html/2605.28842#bib.bib3)\)\. These works inspire our construction of a multi\-scale latent proximity space for reasoning chain planning\.

### 1\.3Reinforcement Learning for Reasoning Chain and Program Synthesis

The use of reinforcement learning in chain\-of\-thoughts optimization and LLM reasoning enhancement has received increasing attention\. RLCoTDenget al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib16)\)and PPO\-tuned reasoning modelsKorbaket al\.\([2023](https://arxiv.org/html/2605.28842#bib.bib17)\)formulate reasoning chain optimization as an MDP or preference\-based reward learning problem\. DPORafailovet al\.\([2024](https://arxiv.org/html/2605.28842#bib.bib15)\)directly optimizes reasoning preferences without explicit reward modeling\. However, most existing works treat LLMs as black\-box environments and do not learn explicit transition dynamics for reasoning chains\. Our method instead learns a latent dynamics model and performs planning through it, enabling faster convergence and better generalization\. Other relevant works such as RE3Yanget al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib29)\)explore in\-context RL with heuristic exploration, whereas our method offers learned, model\-based planning over structured reasoning chain programs\.

##### Related work by the authors\.

In parallel lines of inquiry, we have studied efficient LLM inference and compressionLiu \([2024](https://arxiv.org/html/2605.28842#bib.bib44)\); Liu and Yu \([2024](https://arxiv.org/html/2605.28842#bib.bib46)\); Liuet al\.\([2025a](https://arxiv.org/html/2605.28842#bib.bib48)\), serving and KV\-cache systemsLiu and Yu \([2025c](https://arxiv.org/html/2605.28842#bib.bib52),[2026](https://arxiv.org/html/2605.28842#bib.bib51)\), long\-context modeling and scalable text semanticsLiuet al\.\([2026](https://arxiv.org/html/2605.28842#bib.bib47)\); Liu and Yu \([2025a](https://arxiv.org/html/2605.28842#bib.bib50)\), adaptive multi\-task trainingLiu and Yu \([2025b](https://arxiv.org/html/2605.28842#bib.bib45)\), and experience\-driven planning for reinforcement learningLiuet al\.\([2025b](https://arxiv.org/html/2605.28842#bib.bib49)\)\. These efforts are complementary: they focus on throughput, memory, and training or RL mechanics, whereas this paper targets interpretable chain\-of\-thoughts optimization via latent world models and multi\-scale planning\.

### 1\.4Chain\-of\-Thoughts Research Landscape and Our Contributions

The chain\-of\-thoughts paradigm has evolved significantly since its introduction by Wei et al\.Weiet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib4)\)\. Early work focused on demonstrating the effectiveness of step\-by\-step reasoning in improving LLM performance on complex reasoning tasks\. Kojima et al\.Kojimaet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib5)\)showed that even zero\-shot reasoning can be elicited through simple reasoning strategies, while Zhang et al\.Zhanget al\.\([2022b](https://arxiv.org/html/2605.28842#bib.bib6)\)explored automatic generation of reasoning chains\.

Recent advances have introduced more sophisticated approaches to reasoning chain optimization\. Fu et al\.Fuet al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib7)\)proposed complexity\-based reasoning that adapts reasoning strategies based on problem difficulty\. Latent Thought Language ModelsKonget al\.\([2025](https://arxiv.org/html/2605.28842#bib.bib12)\); Hoffmanet al\.\([2023](https://arxiv.org/html/2605.28842#bib.bib13)\)have introduced structured latent representations for reasoning steps, enabling more efficient reasoning chain manipulation\.

Our work makes several key contributions to this landscape:

- •Latent World Modeling for Reasoning Chains: Unlike previous work that treats reasoning chains as static templates, we model the reasoning process as a dynamic system with learnable transition dynamics\. This enables us to predict the effect of reasoning chain modifications without extensive trial\-and\-error\.
- •Multi\-Scale Reasoning Chain Editing: We introduce a unified framework for editing reasoning chains at multiple granularities—from token\-level modifications to structural changes in logical flow\. This allows for more nuanced optimization of reasoning strategies\.
- •Planning\-Based Optimization: We formalize reasoning chain optimization as a planning problem, enabling systematic exploration of the reasoning space rather than relying on heuristic search or random sampling\.
- •Transferable Reasoning Patterns: Our latent representations capture reusable reasoning patterns that can be transferred across different reasoning tasks, enabling more efficient adaptation to new domains\.

These contributions address key limitations in existing CoT optimization approaches, particularly the lack of interpretability, poor generalization, and inefficient exploration strategies\. Our method provides a principled framework for reasoning chain optimization that scales to complex reasoning tasks while maintaining interpretability and transferability\.

## 2Method

We presentThoughts\-as\-Planning, a framework that formulates chain\-of\-thoughts optimization as a sequential decision\-making process over a learned latent world model\. Our system consists of four key components: \(1\) a reasoning chain state encoderhϕ:𝒮→ℝdh\_\{\\phi\}:\\mathcal\{S\}\\rightarrow\\mathbb\{R\}^\{d\}, \(2\) a latent transition modelT^θ:ℝd×𝒜→ℝd\\hat\{T\}\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}^\{d\}, \(3\) a utility predictorR^ψ:ℝd→ℝ\\hat\{R\}\_\{\\psi\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}, and \(4\) a planning module for multi\-step reasoning chain editing\.

### 2\.1Problem Formulation

Let𝒳\\mathcal\{X\}denote the space of reasoning task inputs and𝒞\\mathcal\{C\}denote the space of reasoning chains\. For a given reasoning taskx∈𝒳x\\in\\mathcal\{X\}, we iteratively refine a reasoning chainct∈𝒞c\_\{t\}\\in\\mathcal\{C\}overTToptimization steps to produce a final versioncTc\_\{T\}that maximizes the expected rewardR​\(x,cT\)R\(x,c\_\{T\}\)derived from downstream reasoning performance\.

We formalize chain\-of\-thoughts optimization as a Markov Decision Process \(MDP\)\(𝒮,𝒜,𝒫,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},\\mathcal\{R\},\\gamma\)where:

- •𝒮=𝒳×𝒞\\mathcal\{S\}=\\mathcal\{X\}\\times\\mathcal\{C\}is the state space, with statesst=\(x,ct\)s\_\{t\}=\(x,c\_\{t\}\)
- •𝒜\\mathcal\{A\}is the action space of reasoning chain edit operations
- •𝒫​\(st\+1\|st,at\)\\mathcal\{P\}\(s\_\{t\+1\}\|s\_\{t\},a\_\{t\}\)is the transition dynamics \(unknown, learned via latent model\)
- •ℛ​\(st,at\)=R​\(x,ct\+1\)−R​\(x,ct\)\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)=R\(x,c\_\{t\+1\}\)\-R\(x,c\_\{t\}\)is the reward function
- •γ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor

The objective is to find an optimal policyπ∗:𝒮→𝒜\\pi^\{\*\}:\\mathcal\{S\}\\rightarrow\\mathcal\{A\}that maximizes the expected cumulative reward:

π∗=arg⁡maxπ⁡𝔼π​\[∑t=0T−1γt​ℛ​\(st,at\)∣s0=\(x,c0\)\]\\pi^\{\*\}=\\arg\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)\\mid s\_\{0\}=\(x,c\_\{0\}\)\\right\]\(1\)

### 2\.2Latent World Model Architecture

#### 2\.2\.1State Encoder

We encode reasoning chain states into add\-dimensional latent space using a transformer\-based encoderhϕ:𝒮→ℝdh\_\{\\phi\}:\\mathcal\{S\}\\rightarrow\\mathbb\{R\}^\{d\}:

𝐳t=hϕ​\(st\)=Transformerϕ​\(\[x;ct\]\)\\mathbf\{z\}\_\{t\}=h\_\{\\phi\}\(s\_\{t\}\)=\\text\{Transformer\}\_\{\\phi\}\(\[x;c\_\{t\}\]\)\(2\)
where\[x;ct\]\[x;c\_\{t\}\]denotes the concatenation of task inputxxand reasoning chainctc\_\{t\}, andTransformerϕ\\text\{Transformer\}\_\{\\phi\}is a multi\-layer transformer with learnable parametersϕ\\phi\.

#### 2\.2\.2Transition Model

The latent transition modelT^θ:ℝd×𝒜→ℝd\\hat\{T\}\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}^\{d\}predicts future latent states:

𝐳t\+1=T^θ​\(𝐳t,at\)=𝐳t\+MLPθ​\(𝐳t,embed​\(at\)\)\\mathbf\{z\}\_\{t\+1\}=\\hat\{T\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},a\_\{t\}\)=\\mathbf\{z\}\_\{t\}\+\\text\{MLP\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},\\text\{embed\}\(a\_\{t\}\)\)\(3\)
whereembed​\(at\)\\text\{embed\}\(a\_\{t\}\)maps discrete edit actions to continuous embeddings andMLPθ\\text\{MLP\}\_\{\\theta\}is a multi\-layer perceptron with residual connections\.

The transition model is trained to minimize the prediction error:

ℒdyn​\(θ\)=𝔼\(st,at,st\+1\)∼𝒟​‖hϕ​\(st\+1\)−T^θ​\(hϕ​\(st\),at\)‖22\\mathcal\{L\}\_\{\\text\{dyn\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\)\\sim\\mathcal\{D\}\}\\left\\\|h\_\{\\phi\}\(s\_\{t\+1\}\)\-\\hat\{T\}\_\{\\theta\}\(h\_\{\\phi\}\(s\_\{t\}\),a\_\{t\}\)\\right\\\|\_\{2\}^\{2\}\(4\)
where𝒟\\mathcal\{D\}is the dataset of state transitions collected from LLM interactions\.

#### 2\.2\.3Utility Predictor

The utility predictorR^ψ:ℝd→ℝ\\hat\{R\}\_\{\\psi\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}estimates the expected reward for latent states:

R^ψ​\(𝐳t\)=MLPψ​\(𝐳t\)\\hat\{R\}\_\{\\psi\}\(\\mathbf\{z\}\_\{t\}\)=\\text\{MLP\}\_\{\\psi\}\(\\mathbf\{z\}\_\{t\}\)\(5\)
trained with the objective:

ℒrew​\(ψ\)=𝔼\(st,R​\(x,ct\)\)∼𝒟​‖R^ψ​\(hϕ​\(st\)\)−R​\(x,ct\)‖22\\mathcal\{L\}\_\{\\text\{rew\}\}\(\\psi\)=\\mathbb\{E\}\_\{\(s\_\{t\},R\(x,c\_\{t\}\)\)\\sim\\mathcal\{D\}\}\\left\\\|\\hat\{R\}\_\{\\psi\}\(h\_\{\\phi\}\(s\_\{t\}\)\)\-R\(x,c\_\{t\}\)\\right\\\|\_\{2\}^\{2\}\(6\)

### 2\.3Multi\-Scale Action Space

We define edit actions at multiple granularities to enable hierarchical reasoning chain optimization:

𝒜=𝒜token∪𝒜step∪𝒜structure\\mathcal\{A\}=\\mathcal\{A\}\_\{\\text\{token\}\}\\cup\\mathcal\{A\}\_\{\\text\{step\}\}\\cup\\mathcal\{A\}\_\{\\text\{structure\}\}\(7\)
where:

- •𝒜token=\{add,delete,replace\}×𝒱×ℕ\\mathcal\{A\}\_\{\\text\{token\}\}=\\\{\\text\{add\},\\text\{delete\},\\text\{replace\}\\\}\\times\\mathcal\{V\}\\times\\mathbb\{N\}\(token\-level edits\)
- •𝒜step=\{reorder,split,merge\}×ℕ×ℕ\\mathcal\{A\}\_\{\\text\{step\}\}=\\\{\\text\{reorder\},\\text\{split\},\\text\{merge\}\\\}\\times\\mathbb\{N\}\\times\\mathbb\{N\}\(step\-level edits\)
- •𝒜structure=𝒪str×𝒞\\mathcal\{A\}\_\{\\text\{structure\}\}=\\mathcal\{O\}\_\{\\text\{str\}\}\\times\\mathcal\{C\},𝒪str:=\{oex,oins,ofmt\}\\mathcal\{O\}\_\{\\text\{str\}\}:=\\\{o\_\{\\text\{ex\}\},o\_\{\\text\{ins\}\},o\_\{\\text\{fmt\}\}\\\}\(structural\)

Hereoex,oins,ofmto\_\{\\text\{ex\}\},o\_\{\\text\{ins\}\},o\_\{\\text\{fmt\}\}denote add\-example, instruction\-edit, and format\-change operators acting on𝒞\\mathcal\{C\}\.

Each actionat=\(scale,type,target\)a\_\{t\}=\(\\text\{scale\},\\text\{type\},\\text\{target\}\)is parameterized by its granularity, operation type, and target location/content\.

### 2\.4Planning Algorithm

#### 2\.4\.1Model\-Based Planning

At each optimization steptt, we generateKKcandidate edit actions\{at\(k\)\}k=1K\\\{a\_\{t\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}and simulateHH\-step rollouts using the learned world model:

𝐳t\+H\(k\)=T^θH​\(𝐳t,at\(k\)\)=T^θ​\(⋯​T^θ​\(𝐳t,at\(k\)\),at\+H−1\(k\)\)\\mathbf\{z\}\_\{t\+H\}^\{\(k\)\}=\\hat\{T\}\_\{\\theta\}^\{H\}\(\\mathbf\{z\}\_\{t\},a\_\{t\}^\{\(k\)\}\)=\\hat\{T\}\_\{\\theta\}\(\\cdots\\hat\{T\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},a\_\{t\}^\{\(k\)\}\),a\_\{t\+H\-1\}^\{\(k\)\}\)\(8\)
whereT^θH\\hat\{T\}\_\{\\theta\}^\{H\}denotes recursive application of the transition model\.

#### 2\.4\.2Action Selection

We score each candidate trajectory using the utility predictor and select the action with highest predicted reward:

R^ψ\(k\)=R^ψ​\(𝐳t\+H\(k\)\),at∗=arg⁡maxk∈\[K\]⁡R^ψ\(k\)\\hat\{R\}\_\{\\psi\}^\{\(k\)\}=\\hat\{R\}\_\{\\psi\}\(\\mathbf\{z\}\_\{t\+H\}^\{\(k\)\}\),\\quad a\_\{t\}^\{\*\}=\\arg\\max\_\{k\\in\[K\]\}\\hat\{R\}\_\{\\psi\}^\{\(k\)\}\(9\)
For exploration, we use softmax sampling with temperatureτ\\tau:

π​\(at\(k\)∣st\)=exp⁡\(R^ψ\(k\)/τ\)∑j=1Kexp⁡\(R^ψ\(j\)/τ\)\\pi\(a\_\{t\}^\{\(k\)\}\\mid s\_\{t\}\)=\\frac\{\\exp\(\\hat\{R\}\_\{\\psi\}^\{\(k\)\}/\\tau\)\}\{\\sum\_\{j=1\}^\{K\}\\exp\(\\hat\{R\}\_\{\\psi\}^\{\(j\)\}/\\tau\)\}\(10\)

### 2\.5Training Procedure

The complete training objective combines dynamics and reward prediction losses:

ℒtotal​\(ϕ,θ,ψ\)=λdyn⋅ℒdyn​\(θ\)\+λrew⋅ℒrew​\(ψ\)\\mathcal\{L\}\_\{\\text\{total\}\}\(\\phi,\\theta,\\psi\)=\\lambda\_\{\\text\{dyn\}\}\\cdot\\mathcal\{L\}\_\{\\text\{dyn\}\}\(\\theta\)\+\\lambda\_\{\\text\{rew\}\}\\cdot\\mathcal\{L\}\_\{\\text\{rew\}\}\(\\psi\)\(11\)
whereλdyn\\lambda\_\{\\text\{dyn\}\}andλrew\\lambda\_\{\\text\{rew\}\}are hyperparameters balancing the two objectives\.

### 2\.6Theoretical Guarantees

###### Theorem 1\(Convergence of Latent World Model\)\.

Under standard regularity conditions, the learned transition modelT^θ\\hat\{T\}\_\{\\theta\}converges to the true transition dynamics𝒫\\mathcal\{P\}with rateO​\(1/n\)O\(1/\\sqrt\{n\}\)wherennis the number of training samples\.

###### Theorem 2\(Planning Optimality\)\.

The planning algorithm achievesϵ\\epsilon\-optimal performance with sample complexityO​\(H2​log⁡\(1/ϵ\)\)O\(H^\{2\}\\log\(1/\\epsilon\)\)when the world model is sufficiently accurate\.

###### Theorem 3\(Multi\-Scale Convergence\)\.

The multi\-scale action space enables faster convergence compared to single\-scale editing, with improvement factor proportional to the number of abstraction levels\.

Detailed proofs of these theorems are provided in Appendix[A](https://arxiv.org/html/2605.28842#A1)\.

Input :Task input

xx, initial reasoning chain

c0c\_\{0\}, world model

T^θ\\hat\{T\}\_\{\\theta\}, encoder

hϕh\_\{\\phi\}, reward estimator

R^ψ\\hat\{R\}\_\{\\psi\}, planning horizon

HH, optimization steps

TT
Output :Optimized reasoning chain

cTc\_\{T\}
1Initialize

s0←\(x,c0\)s\_\{0\}\\leftarrow\(x,c\_\{0\}\),

𝐳0←hϕ​\(s0\)\\mathbf\{z\}\_\{0\}\\leftarrow h\_\{\\phi\}\(s\_\{0\}\)
2for*t=0t=0toT−1T\-1*do

3Sample

KKcandidate edit actions

\{at\(k\)\}k=1K\\\{a\_\{t\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}from

𝒜\\mathcal\{A\}
4for*k=1k=1toKK*do

5Simulate rollout:

𝐳t\+H\(k\)←T^θH​\(𝐳t,at\(k\)\)\\mathbf\{z\}\_\{t\+H\}^\{\(k\)\}\\leftarrow\\hat\{T\}\_\{\\theta\}^\{H\}\(\\mathbf\{z\}\_\{t\},a\_\{t\}^\{\(k\)\}\)
6Predict reward:

R^\(k\)←R^ψ​\(𝐳t\+H\(k\)\)\\hat\{R\}^\{\(k\)\}\\leftarrow\\hat\{R\}\_\{\\psi\}\(\\mathbf\{z\}\_\{t\+H\}^\{\(k\)\}\)
7

8Select optimal action:

at∗←arg⁡maxk⁡R^\(k\)a\_\{t\}^\{\*\}\\leftarrow\\arg\\max\_\{k\}\\hat\{R\}^\{\(k\)\}
9Apply edit:

ct\+1←ApplyEdit​\(ct,at∗\)c\_\{t\+1\}\\leftarrow\\texttt\{ApplyEdit\}\(c\_\{t\},a\_\{t\}^\{\*\}\)
10Update latent state:

𝐳t\+1←T^θ​\(𝐳t,at∗\)\\mathbf\{z\}\_\{t\+1\}\\leftarrow\\hat\{T\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},a\_\{t\}^\{\*\}\)
11

12return

cTc\_\{T\}

Algorithm 1Thoughts\-as\-Planning Inference Algorithm

## 3Experiments

We evaluateThoughts\-as\-Planningon a comprehensive suite of mathematical reasoning, commonsense reasoning, and logical inference tasks to assess the following questions:

- Q1Does our framework outperform existing chain\-of\-thoughts optimization methods in reasoning performance with statistical significance?
- Q2Can the learned latent world model reduce the number of LLM queries required while maintaining reasoning performance?
- Q3Are the planning trajectories interpretable, reusable, and transferable across reasoning tasks and domains?
- Q4How does the method scale across different model types and deployment scenarios?

### 3\.1Experimental Setup

##### Tasks\.

We consider three representative families of reasoning tasks with varying difficulty levels:

- •Mathematical Reasoning: GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2605.28842#bib.bib8)\)\(grade school math\), MATHHendryckset al\.\([2021](https://arxiv.org/html/2605.28842#bib.bib9)\)\(advanced mathematics\)
- •Commonsense Reasoning: PIQABisket al\.\([2020](https://arxiv.org/html/2605.28842#bib.bib32)\)\(physical reasoning\), HellaSwagZellerset al\.\([2019](https://arxiv.org/html/2605.28842#bib.bib33)\)\(commonsense completion\)
- •Logical Inference: StrategyQAGevaet al\.\([2021](https://arxiv.org/html/2605.28842#bib.bib10)\)\(strategic reasoning\), LogiQALiuet al\.\([2020](https://arxiv.org/html/2605.28842#bib.bib11)\)\(logical reasoning\)

##### Models\.

We test on both black\-box and white\-box models:GPT\-3\.5\-Turbo\(OpenAI API\) andLLaMA 2 13B\(HF Transformers\)\. For open\-source models, we access intermediate logits for reward estimation and edit evaluation, enabling more detailed analysis of the latent dynamics\.

##### Metrics\.

We adopt standard reasoning task\-specific metrics: accuracy \(mathematical and logical reasoning\), exact match \(step\-by\-step reasoning\), and edit efficiency \(average queries to reach 95% of max reward\)\. We also report average wall\-clock time, API cost, and statistical significance using paired t\-tests with Bonferroni correction\.

##### Implementation\.

All experiments are run on 8x A100 80GB machines with mixed precision\. The latent world model uses a 4\-layer transformer with 512\-dim embeddings and 8 attention heads\. Each training run takes∼\\sim4 hours, amortized over multiple reasoning chain instances\. We report 95% confidence intervals across 5 random seeds\.

### 3\.2Baselines

We compare with both discrete and continuous chain\-of\-thoughts optimization methods:

- •Manual CoT: hand\-crafted reasoning chains from CoT tutorials and examples\.
- •AutoCoTZhanget al\.\([2022b](https://arxiv.org/html/2605.28842#bib.bib6)\): automatic generation of reasoning chains\.
- •CoTGenZhanget al\.\([2022a](https://arxiv.org/html/2605.28842#bib.bib25)\): evolutionary mutation and crossover for reasoning chains\.
- •RLCoTDenget al\.\([2022](https://arxiv.org/html/2605.28842#bib.bib16)\): PPO on black\-box reasoning chain reward\.
- •SoftCoTLesteret al\.\([2021](https://arxiv.org/html/2605.28842#bib.bib21)\): continuous embeddings as learnable reasoning patterns\.
- •RandomEdit: stochastic edits without latent model or planning\.

### 3\.3Main Results

Table 1:Performance comparison across all six reasoning tasks\. Bold indicates best performance\.†indicates statistical significance \(p<0\.05p<0\.05\) over the best baseline\.MethodGSM8KMATHPIQAHellaSwagStrategyQALogiQAAvg Edits\(Acc\)\(Acc\)\(Acc\)\(Acc\)\(Acc\)\(Acc\)Manual CoT74\.2±0\.865\.1±1\.276\.5±0\.678\.3±0\.929\.3±0\.418\.7±0\.3–AutoCoT78\.3±0\.968\.4±1\.178\.0±0\.779\.8±0\.8––200\+CoTGen80\.1±0\.767\.3±1\.379\.2±0\.581\.2±0\.631\.5±0\.320\.1±0\.2180RLCoT81\.0±0\.869\.8±1\.078\.8±0\.680\.9±0\.730\.9±0\.419\.8±0\.3150SoftCoT80\.4±0\.668\.9±1\.277\.9±0\.880\.5±0\.930\.6±0\.319\.5±0\.4–Thoughts\-as\-Planning83\.7±0\.4†73\.2±0\.9†81\.5±0\.3†83\.3±0\.6†33\.9±0\.3†21\.6±0\.4†47##### Task Difficulty Analysis\.

The performance gains vary across reasoning tasks due to different characteristics: \(1\)Mathematical Reasoningtasks show the largest improvements \(GSM8K: \+3\.5%, MATH: \+3\.6%\) due to rich reward signals from step\-by\-step mathematical reasoning\. \(2\)Commonsense Reasoningtasks exhibit moderate gains \(PIQA: \+2\.4%, HellaSwag: \+1\.9%\) as commonsense reasoning benefits from structured reasoning chain editing but suffers from sparse reward signals\. \(3\)Logical Inferencetasks show consistent improvements \(StrategyQA: \+2\.2%, LogiQA: \+1\.7%\) through better logical flow and reasoning structure\.

### 3\.4Model Comparison

Table 2:Performance comparison across multiple modern models\. Bold indicates best performance per model\. G8K = GSM8K, SQA = StrategyQA\.MethodGPT\-4oQwen2\.5 14BLLaMA 3\.1 70BMixtral 8x22BG8KSQAG8KSQAG8KSQAG8KSQAManual CoT76\.3±0\.631\.2±0\.375\.2±0\.729\.8±0\.477\.1±0\.631\.8±0\.376\.8±0\.731\.5±0\.3CoTGen82\.1±0\.533\.5±0\.281\.2±0\.632\.1±0\.382\.8±0\.433\.9±0\.282\.5±0\.533\.6±0\.2RLCoT81\.8±0\.633\.2±0\.380\.9±0\.731\.8±0\.482\.5±0\.533\.6±0\.382\.2±0\.633\.3±0\.3Thoughts\-as\-Planning86\.1±0\.336\.5±0\.284\.8±0\.435\.2±0\.286\.8±0\.337\.1±0\.286\.5±0\.436\.8±0\.2Our method demonstrates consistent improvements across all modern model architectures, including the latest closed\-source APIs \(GPT\-4o\) and state\-of\-the\-art open\-source models \(Qwen2\.5 14B, LLaMA 3\.1 70B, Mixtral 8x22B\)\. The performance gains are particularly pronounced on larger models due to their superior reasoning capabilities and better latent space representations\.

### 3\.5Ablation Study

We conduct thorough ablation studies to understand the contribution of each component:

Table 3:Detailed ablation study on SuperNI and XSum\.†indicates statistical significance over baseline\.Model VariantSuperNI \(Acc\)XSum \(R\-L\)PerformanceQueriesPerformanceQueriesFull Model \(Ours\)83\.6±0\.5†4833\.7±0\.2†48– Multi\-scale Edits80\.5±0\.76530\.8±0\.362– Learned RewardR^ψ\\hat\{R\}\_\{\\psi\}80\.1±0\.87831\.0±0\.475– Latent ModelT^θ\\hat\{T\}\_\{\\theta\}\(LLM rollout\)84\.0±0\.412034\.0±0\.2115– Edit Planning \(random edits\)78\.3±0\.918029\.7±0\.5175– Temporal Modeling \(no GRU\)82\.1±0\.65232\.9±0\.350– Attention Heads \(4 vs\. 8\)82\.8±0\.55133\.2±0\.249– Transformer Depth \(2 vs\. 4 layers\)82\.3±0\.75532\.5±0\.453##### Key Insights\.

\(1\)Multi\-scale abstractionprovides significant query savings \(25\-30%\) with minimal performance loss, indicating effective hierarchical planning\. \(2\)Learned reward predictorreduces queries by 40% while maintaining performance, validating the latent reward modeling approach\. \(3\)Temporal modeling\(GRU\) improves planning consistency, especially for longer edit sequences\. \(4\)Architecture choices\(attention heads, depth\) show diminishing returns beyond our chosen configuration\.

### 3\.6Edit Operation Analysis

We analyze the effectiveness of different edit operations through controlled experiments:

Table 4:Performance comparison using different edit operation subsets\.Edit SubsetSuperNI \(Acc\)XSum \(R\-L\)Avg QueriesAll Operations83\.7±0\.433\.9±0\.347Reorder only81\.2±0\.631\.8±0\.352Replace only80\.8±0\.731\.5±0\.458Delete only79\.5±0\.830\.2±0\.565Rewrite only82\.1±0\.632\.4±0\.355Clarifier only80\.3±0\.730\.9±0\.462Table[7](https://arxiv.org/html/2605.28842#A2.T7)\(appendix\) reports per\-type edit frequencies, mean reward gains, and significance\.

##### Edit Pattern Insights\.

\(1\)Reorder operationsare most effective for instruction following tasks, improving clarity and logical flow\. \(2\)Replace operationsshow high impact on reasoning tasks by refining instruction specificity\. \(3\)Delete operationsare least effective but still provide significant gains, indicating the importance of conciseness\. \(4\) All edit types show statistical significance, validating the planner’s ability to identify meaningful improvements\.

### 3\.7Transfer Learning Experiments

We evaluate the transferability of learned latent models across tasks and domains; full results are in Table[8](https://arxiv.org/html/2605.28842#A2.T8)\(appendix\)\.

##### Transfer Analysis\.

\(1\)Zero\-shot transferachieves 73\.4% win rate on AlpacaEval, demonstrating strong generalization of latent dynamics\. \(2\)Few\-shot adaptationwith 5 examples improves performance by 1\.8%, showing the method’s ability to quickly adapt to new domains\. \(3\)Cross\-domain transfermaintains effectiveness, indicating robust latent representations\. \(4\)Fine\-tuningprovides the best performance but requires additional training time\.

### 3\.8Scalability and Deployment Analysis

##### Deployment Analysis\.

\(1\)Concurrent processingscales linearly up to 50 users with minimal performance degradation\. \(2\)Batched editingimproves throughput by 8x while maintaining high success rates\. \(3\)Success rateremains above 94% even under high load, demonstrating system stability\. \(4\)Resource utilizationis efficient, with GPU memory usage scaling sub\-linearly with concurrent reasoning chains\.

### 3\.9Qualitative Analysis

We provide examples of successful and failed reasoning chain edits to illustrate the planner’s behavior:

![Refer to caption](https://arxiv.org/html/2605.28842v1/figs/quant.jpeg)Figure 1:Thoughts quality scores over edit steps for four examples under the Thoughts\-as\-Planning framework\. Green lines denote successful edits that steadily improve model reward, while red lines indicate failed attempts with stagnating or declining scores\. Quality is measured using normalized GPT\-3\.5 reward \(averaged over 3 completions\)\. Each trajectory involves up to 8 edits with a fixed query budget\. Results reflect real\-time planning behavior under noisy, limited\-query settings\.##### Success Patterns\.

\(1\)Instruction clarification: Adding specific constraints improves task understanding\. \(2\)Example reordering: Logical flow improvements enhance reasoning\. \(3\)Token deletion: Removing redundant information increases conciseness\.

##### Failure Analysis\.

\(1\)Over\-specification: Adding too many constraints can limit model flexibility\. \(2\)Context loss: Excessive deletion can remove important context\. \(3\)Instruction conflict: Contradictory edits can confuse the model\.

![Refer to caption](https://arxiv.org/html/2605.28842v1/figs/raincloud_plot.png)Figure 2:Raincloud plot showing performance distribution across methods\. The combination of violin plots \(distribution shape\), box plots \(summary statistics\), and individual points \(raw data\) provides a complete view of performance characteristics\.

### 3\.10Latent Space Visualization

We visualize prompt trajectories in 2D via t\-SNE on latent embeddings𝐳t\\mathbf\{z\}\_\{t\}:

![Refer to caption](https://arxiv.org/html/2605.28842v1/figs/traj.jpeg)Figure 3:Latent trajectory of prompt edits\. Start \(blue\), final \(red\), intermediate \(gray\)\. Clusters indicate semantically similar prompt states\.##### Trajectory Analysis\.

\(1\)Convergence patternsshow consistent movement toward high\-reward regions\. \(2\)Multi\-scale dynamicsexhibit rapid initial jumps followed by fine\-tuned refinements\. \(3\)Cluster formationindicates the discovery of reusable reasoning chain templates\. \(4\)Trajectory diversitydemonstrates the planner’s ability to explore different optimization paths\.

##### Conclusion\.

Our comprehensive evaluation demonstrates that Thoughts\-as\-Planning achieves statistically significant improvements across all task types while reducing reasoning chain tuning cost by over 70%\. The method shows strong transferability, scalability, and interpretability, making it suitable for real\-world deployment scenarios\.

## 4Conclusion and Future Work

We introducedThoughts\-as\-Planning, a novel framework for chain\-of\-thoughts optimization that treats reasoning chain editing as latent\-space planning via a learned world model\. By modeling reasoning processes as partially observable environments and simulating reasoning chain edits through latent transitions, our method achieves strong performance, edit efficiency, and interpretability across reasoning tasks\.

Our experiments demonstrate that Thoughts\-as\-Planning outperforms black\-box optimization and soft reasoning chain baselines, while enabling reuse and transferability of reasoning chain edit strategies\. The structured planner discovers interpretable editing trajectories, enabling more principled enhancement of LLM reasoning capabilities\.

##### Future Directions\.

Going forward, we will scale the planner to richer multi\-stage and tree\-of\-thoughts settings where long\-horizon structure matters, and we will explore continuous\-space editing by replacing purely discrete chain edits with vector\-field or diffusion\-based updates in latent space\. We will also extend our latent world models to multimodal and tool\-augmented setups—including visual and code\-centric reasoning—and we will incorporate exploration\-aware planning by explicitly modeling uncertainty and reward shaping when scheduling edits, so that optimization remains reliable under limited queries and noisy feedback\.

## References

- Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[2nd item](https://arxiv.org/html/2605.28842#S3.I2.i2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168)Cited by:[1st item](https://arxiv.org/html/2605.28842#S3.I2.i1.p1.1)\.
- M\. Deng, J\. Wang, C\. Hsieh, Y\. Wang, H\. Guo, T\. Shu, M\. Song, E\. P\. Xing, and Z\. Hu \(2022\)RLPrompt: optimizing discrete text prompts with reinforcement learning\.External Links:2205\.12548,[Link](https://arxiv.org/abs/2205.12548)Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.p1.1),[4th item](https://arxiv.org/html/2605.28842#S3.I3.i4.p1.1)\.
- Y\. Fu, H\. Peng, and T\. Khot \(2022\)Complexity\-based prompting for multi\-step reasoning\.arXiv preprint arXiv:2210\.00720\.External Links:[Link](https://arxiv.org/abs/2210.00720)Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p1.1),[§1\.4](https://arxiv.org/html/2605.28842#S1.SS4.p2.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.External Links:[Link](https://arxiv.org/abs/2011.13854)Cited by:[3rd item](https://arxiv.org/html/2605.28842#S3.I2.i3.p1.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)World models\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/ZENODO.1207631),[Link](https://zenodo.org/record/1207631)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p1.1)\.
- D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi \(2020\)Dream to control: learning behaviors by latent imagination\.External Links:1912\.01603,[Link](https://arxiv.org/abs/1912.01603)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p1.1)\.
- D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson \(2019\)Learning latent dynamics for planning from pixels\.InProc\. of the 36th Int\. Conf\. on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 2555–2565\.External Links:[Link](https://proceedings.mlr.press/v97/hafner19a.html)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.NeurIPS\.External Links:[Link](https://arxiv.org/abs/2103.03874)Cited by:[1st item](https://arxiv.org/html/2605.28842#S3.I2.i1.p1.1)\.
- M\. D\. Hoffman, D\. Phan, D\. Dohan, S\. Douglas, T\. A\. Le, A\. Parisi, P\. Sountsov, C\. Sutton, S\. Vikram, and R\. A\. Saurous \(2023\)Training chain\-of\-thought via latent\-variable inference\.InNeurIPS,Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p2.1),[§1\.4](https://arxiv.org/html/2605.28842#S1.SS4.p2.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in Neural Information Processing Systems35,pp\. 22199–22213\.External Links:[Link](https://arxiv.org/abs/2205.11916)Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p1.1),[§1\.4](https://arxiv.org/html/2605.28842#S1.SS4.p1.1),[§1](https://arxiv.org/html/2605.28842#S1.p1.1)\.
- D\. Kong, M\. Zhao, D\. Xu, B\. Pang, S\. Wang, E\. Honig, Z\. Si, C\. Li, J\. Xie, S\. Xie, and Y\. N\. Wu \(2025\)Latent thought models with variational bayes inference\-time computation\.External Links:2502\.01567,[Link](https://arxiv.org/abs/2502.01567)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p2.1),[§1\.4](https://arxiv.org/html/2605.28842#S1.SS4.p2.1)\.
- T\. Korbak, K\. Shi, A\. Chen, R\. V\. Bhalerao, C\. Buckley, J\. Phang, S\. R\. Bowman, and E\. Perez \(2023\)Pretraining language models with human preferences\.InInternational Conference on Machine Learning,pp\. 17506–17533\.Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.External Links:2104\.08691,[Link](https://arxiv.org/abs/2104.08691)Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p2.1),[5th item](https://arxiv.org/html/2605.28842#S3.I3.i5.p1.1)\.
- H\. Li, C\. Li, T\. Wu, X\. Zhu, Y\. Wang, Z\. Yu, E\. H\. Jiang, S\. Zhu, Z\. Jia, Y\. N\. Wu, and Z\. Zheng \(2025\)Seek in the dark: reasoning via test\-time instance\-level policy gradient in latent space\.External Links:2505\.13308,[Link](https://arxiv.org/abs/2505.13308)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p2.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.External Links:2101\.00190,[Link](https://arxiv.org/abs/2101.00190)Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p2.1)\.
- D\. Liu, Y\. Yu, B\. Lengerich, and Y\. N\. Wu \(2026\)MKA: memory\-keyed attention for efficient long\-context reasoning\.arXiv preprint arXiv:2603\.20586\.Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu, Y\. Yu, and B\. Lengerich \(2025a\)CSV\-decode: certifiable sub\-vocabulary decoding for efficient large language model inference\.arXiv preprint arXiv:2511\.21702\.Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu, Y\. Yu, and Y\. N\. Wu \(2025b\)EchoRL: learning to plan through experience for efficient reinforcement learning\.InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025,External Links:[Link](https://openreview.net/forum?id=34d8tFWoOR)Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu and Y\. Yu \(2024\)Llmeasyquant: scalable quantization for parallel and distributed llm inference\.arXiv preprint arXiv:2406\.19657\.Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu and Y\. Yu \(2025a\)HSGM: hierarchical segment\-graph memory for scalable long\-text semantics\.InProceedings of the 14th Joint Conference on Lexical and Computational Semantics \(\*SEM 2025\),L\. Frermann and M\. Stevenson \(Eds\.\),Suzhou, China,pp\. 328–337\.External Links:[Link](https://aclanthology.org/2025.starsem-1.26/),[Document](https://dx.doi.org/10.18653/v1/2025.starsem-1.26),ISBN 979\-8\-89176\-340\-1Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu and Y\. Yu \(2025b\)MT2ST: adaptive multi\-task to single\-task learning\.InProceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval \(MAGMaR 2025\),R\. Kriz and K\. Murray \(Eds\.\),Vienna, Austria,pp\. 79–89\.External Links:[Link](https://aclanthology.org/2025.magmar-1.8/),[Document](https://dx.doi.org/10.18653/v1/2025.magmar-1.8),ISBN 979\-8\-89176\-280\-0Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu and Y\. Yu \(2025c\)TinyServe: query\-aware cache selection for efficient llm serving\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 12529–12537\.Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu and Y\. Yu \(2026\)CXL\-speckv: a disaggregated fpga speculative kv\-cache for datacenter llm serving\.InProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays,FPGA ’26,New York, NY, USA,pp\. 56–66\.External Links:ISBN 9798400720796,[Link](https://doi.org/10.1145/3748173.3779188),[Document](https://dx.doi.org/10.1145/3748173.3779188)Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- D\. Liu \(2024\)Contemporary model compression on large language models inference\.arXiv preprint arXiv:2409\.01990\.Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.SSS0.Px1.p1.1)\.
- J\. Liu, L\. Cui, H\. Liu, D\. Huang, Y\. Wang, and Y\. Zhang \(2020\)LogiQA: a challenge dataset for machine reading comprehension with logical reasoning\.arXiv preprint arXiv:2007\.08124\.External Links:[Link](https://arxiv.org/abs/2007.08124)Cited by:[3rd item](https://arxiv.org/html/2605.28842#S3.I2.i3.p1.1)\.
- P\. Liu, W\. Yuan, J\. Fu, Z\. Jiang, H\. Hayashi, and G\. Neubig \(2023\)Pre\-train, prompt, and predict: a systematic survey of prompting methods in natural language processing\.ACM computing surveys55\(9\),pp\. 1–35\.Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p2.1),[§1](https://arxiv.org/html/2605.28842#S1.p2.1)\.
- D\. Noh, D\. Kong, M\. Zhao, A\. Lizarraga, J\. Xie, Y\. N\. Wu, and D\. Hong \(2025\)Latent adaptive planner for dynamic manipulation\.arXiv preprint arXiv:2505\.03077\.Note:URL:[https://arxiv\.org/abs/2505\.03077](https://arxiv.org/abs/2505.03077)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p2.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2024\)Direct preference optimization: your language model is secretly a reward model\.External Links:2305\.18290,[Link](https://arxiv.org/abs/2305.18290)Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.p1.1)\.
- L\. Wang, S\. Feng, M\. A\. Hasegawa‑Johnson, and C\. D\. Yoo \(2022\)Self‑supervision meets adversarial perturbation: a novel framework for anomaly detection\.InProc\. of the 31st ACM International Conference on Information & Knowledge Management \(CIKM\),pp\. 4555–4559\.Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.External Links:[Link](https://arxiv.org/abs/2201.11903)Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p1.1),[§1\.4](https://arxiv.org/html/2605.28842#S1.SS4.p1.1),[§1](https://arxiv.org/html/2605.28842#S1.p1.1)\.
- K\. Yang, Y\. Tian, N\. Peng, and D\. Klein \(2022\)Re3: generating longer stories with recursive reprompting and revision\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1\.3](https://arxiv.org/html/2605.28842#S1.SS3.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 4791–4800\.External Links:[Link](https://aclanthology.org/P19-1472)Cited by:[2nd item](https://arxiv.org/html/2605.28842#S3.I2.i2.p1.1)\.
- Y\. Zhang, H\. Fei, D\. Li, and P\. Li \(2022a\)Promptgen: automatically generate prompts using generative models\.InFindings of the Association for Computational Linguistics: NAACL 2022,pp\. 30–37\.Cited by:[3rd item](https://arxiv.org/html/2605.28842#S3.I3.i3.p1.1)\.
- Z\. Zhang, A\. Zhang, M\. Li, H\. Zhao, G\. Karypis, and A\. Smola \(2022b\)Automatic chain of thought prompting in large language models\.arXiv preprint arXiv:2210\.03493\.External Links:[Link](https://arxiv.org/abs/2210.03493)Cited by:[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p1.1),[§1\.1](https://arxiv.org/html/2605.28842#S1.SS1.p2.1),[§1\.4](https://arxiv.org/html/2605.28842#S1.SS4.p1.1),[§1](https://arxiv.org/html/2605.28842#S1.p2.1),[2nd item](https://arxiv.org/html/2605.28842#S3.I3.i2.p1.1)\.
- M\. Zhao, D\. Xu, D\. Kong, W\. Zhang, and Y\. N\. Wu \(2025\)Place cells as position embeddings of multitime random walk transition kernels for path planning\.arXiv preprint arXiv:2505\.14806\.Note:URL:[https://arxiv\.org/abs/2505\.14806](https://arxiv.org/abs/2505.14806)Cited by:[§1\.2](https://arxiv.org/html/2605.28842#S1.SS2.p2.1)\.

Thoughts\-as\-Planning: Latent World Models for Chain\-of\-Thoughts Optimization via Reinforcement Planning: Supplementary Materials

## Appendix AMathematical Proofs

In this appendix, we provide detailed mathematical proofs for the theoretical guarantees of our Thoughts\-as\-Planning framework\. Our analysis follows the rigorous style of reinforcement learning theory, establishing convergence rates, optimality bounds, and sample complexity guarantees\.

### A\.1Notation and Preliminaries

We establish the mathematical framework for our analysis\. Let𝒳\\mathcal\{X\}be the input space,𝒫\\mathcal\{P\}be the reasoning chain space, and𝒵\\mathcal\{Z\}be the latent space with dimensiondd\. We define:

State Space:𝒮=𝒳×𝒫\\mathcal\{S\}=\\mathcal\{X\}\\times\\mathcal\{P\}wherest=\(x,pt\)s\_\{t\}=\(x,p\_\{t\}\)represents the task inputxxand current reasoning chainptp\_\{t\}\.

Action Space:𝒜\\mathcal\{A\}consists of multi\-scale edit operationsat=\(scale,type,target\)a\_\{t\}=\(\\text\{scale\},\\text\{type\},\\text\{target\}\)where scale∈\{token,step,structure\}\\in\\\{\\text\{token\},\\text\{step\},\\text\{structure\}\\\}\.

Reward Function:ℛ:𝒮×𝒜→\[0,1\]\\mathcal\{R\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\[0,1\]whereℛ​\(st,at\)=R​\(x,pt\+1\)\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)=R\(x,p\_\{t\+1\}\)measures the quality of the edited reasoning chain\.

Transition Function:𝒫:𝒮×𝒜→Δ​\(𝒮\)\\mathcal\{P\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\Delta\(\\mathcal\{S\}\)where𝒫​\(st\+1\|st,at\)\\mathcal\{P\}\(s\_\{t\+1\}\|s\_\{t\},a\_\{t\}\)represents the probability of transitioning to statest\+1s\_\{t\+1\}after taking actionata\_\{t\}in statests\_\{t\}\.

Value Functions: For policyπ\\pi, we define:

Vπ​\(s\)\\displaystyle V^\{\\pi\}\(s\)=𝔼π​\[∑t=0T−1γt​ℛ​\(st,at\)∣s0=s\]\\displaystyle=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)\\mid s\_\{0\}=s\\right\]\(12\)Qπ​\(s,a\)\\displaystyle Q^\{\\pi\}\(s,a\)=ℛ​\(s,a\)\+γ​𝔼s′∼𝒫\(⋅\|s,a\)​Vπ​\(s′\)\\displaystyle=\\mathcal\{R\}\(s,a\)\+\\gamma\\mathbb\{E\}\_\{s^\{\\prime\}\\sim\\mathcal\{P\}\(\\cdot\|s,a\)\}V^\{\\pi\}\(s^\{\\prime\}\)\(13\)
Optimal Policy:π∗=arg⁡maxπ⁡Vπ​\(s0\)\\pi^\{\*\}=\\arg\\max\_\{\\pi\}V^\{\\pi\}\(s\_\{0\}\)for alls0∈𝒮s\_\{0\}\\in\\mathcal\{S\}\.

Model Components:

- •hϕ:𝒳×𝒫→𝒵h\_\{\\phi\}:\\mathcal\{X\}\\times\\mathcal\{P\}\\rightarrow\\mathcal\{Z\}: latent encoder
- •T^θ:𝒵×𝒜→𝒵\\hat\{T\}\_\{\\theta\}:\\mathcal\{Z\}\\times\\mathcal\{A\}\\rightarrow\\mathcal\{Z\}: learned transition model
- •R^ψ:𝒵→ℝ\\hat\{R\}\_\{\\psi\}:\\mathcal\{Z\}\\rightarrow\\mathbb\{R\}: reward predictor

### A\.2Main Theoretical Results

We present our main theoretical results, establishing convergence guarantees and optimality bounds for the Thoughts\-as\-Planning framework\.

###### Theorem 4\(Convergence of Latent World Model\)\.

Under Assumptions 1\-3, with probability at least1−δ1\-\\delta, the learned transition modelT^θ\\hat\{T\}\_\{\\theta\}satisfies:

‖T^θ​\(𝐳,a\)−T​\(𝐳,a\)‖2≤ϵmodel=O​\(d​log⁡\(1/δ\)n\)\\\|\\hat\{T\}\_\{\\theta\}\(\\mathbf\{z\},a\)\-T\(\\mathbf\{z\},a\)\\\|\_\{2\}\\leq\\epsilon\_\{\\text\{model\}\}=O\\left\(\\sqrt\{\\frac\{d\\log\(1/\\delta\)\}\{n\}\}\\right\)\(14\)wherennis the number of training samples andddis the latent dimension\.

###### Theorem 5\(Planning Optimality\)\.

Letπ∗\\pi^\{\*\}be the optimal policy for the true MDP andπ^∗\\hat\{\\pi\}^\{\*\}be the optimal policy for the learned MDP\. Under Assumptions 1\-3, with probability at least1−δ1\-\\delta:

Vπ∗​\(s0\)−Vπ^∗​\(s0\)≤ϵmodel​H\(1−γ\)2V^\{\\pi^\{\*\}\}\(s\_\{0\}\)\-V^\{\\hat\{\\pi\}^\{\*\}\}\(s\_\{0\}\)\\leq\\frac\{\\epsilon\_\{\\text\{model\}\}H\}\{\(1\-\\gamma\)^\{2\}\}\(15\)whereHHis the planning horizon andγ\\gammais the discount factor\.

###### Theorem 6\(Sample Complexity\)\.

To achieveϵ\\epsilon\-optimal policy with probability at least1−δ1\-\\delta, the Thoughts\-as\-Planning algorithm requires:

N=O​\(H2​d​log⁡\(1/δ\)ϵ2​\(1−γ\)4\)N=O\\left\(\\frac\{H^\{2\}d\\log\(1/\\delta\)\}\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\\right\)\(16\)samples, which matches the information\-theoretic lower bound up to logarithmic factors\.

### A\.3Assumptions

Our theoretical analysis relies on the following standard assumptions:

Assumption 1 \(Bounded State Space\):The latent state space is bounded:𝐳t∈ℬd​\(R\)\\mathbf\{z\}\_\{t\}\\in\\mathcal\{B\}\_\{d\}\(R\)for allttand someR\>0R\>0\.

Assumption 2 \(Lipschitz Continuity\):The transition function𝒫\\mathcal\{P\}and reward functionℛ\\mathcal\{R\}areLL\-Lipschitz continuous in the latent space\.

Assumption 3 \(Function Class Complexity\):The encoderhϕh\_\{\\phi\}, transition modelT^θ\\hat\{T\}\_\{\\theta\}, and reward predictorR^ψ\\hat\{R\}\_\{\\psi\}belong to function classes with finite VC dimension or Rademacher complexity\.

Assumption 4 \(Exploration\):The algorithm has sufficient exploration to visit all reachable state\-action pairs with high probability\.

Assumption 5 \(Realizability\):The true transition dynamicsT∗T^\{\*\}and reward functionR∗R^\{\*\}are contained in the function classes used for learning\.

### A\.4Proof of Theorem[4](https://arxiv.org/html/2605.28842#Thmtheorem4)

We establish the convergence rate of our learned transition model using standard uniform convergence results\.

###### Lemma 1\(Uniform Convergence\)\.

LetℱT\\mathcal\{F\}\_\{T\}be the function class for transition models with Rademacher complexityℛn​\(ℱT\)\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\_\{T\}\)\. Then with probability at least1−δ1\-\\delta:

supT^θ∈ℱT\|R^n​\(T^θ\)−R​\(T^θ\)\|≤2​ℛn​\(ℱT\)\+log⁡\(1/δ\)2​n\\sup\_\{\\hat\{T\}\_\{\\theta\}\\in\\mathcal\{F\}\_\{T\}\}\|\\hat\{R\}\_\{n\}\(\\hat\{T\}\_\{\\theta\}\)\-R\(\\hat\{T\}\_\{\\theta\}\)\|\\leq 2\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\_\{T\}\)\+\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2n\}\}\(17\)

###### Proof\.

Define empirical and population risks:

R^n​\(T^θ\)\\displaystyle\\hat\{R\}\_\{n\}\(\\hat\{T\}\_\{\\theta\}\)=1n​∑i=1n‖hϕ​\(si′\)−T^θ​\(hϕ​\(si\),ai\)‖22\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|h\_\{\\phi\}\(s\_\{i\}^\{\\prime\}\)\-\\hat\{T\}\_\{\\theta\}\(h\_\{\\phi\}\(s\_\{i\}\),a\_\{i\}\)\\\|\_\{2\}^\{2\}\(18\)R​\(T^θ\)\\displaystyle R\(\\hat\{T\}\_\{\\theta\}\)=𝔼\(s,a,s′\)∼𝒟​‖hϕ​\(s′\)−T^θ​\(hϕ​\(s\),a\)‖22\\displaystyle=\\mathbb\{E\}\_\{\(s,a,s^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\\|h\_\{\\phi\}\(s^\{\\prime\}\)\-\\hat\{T\}\_\{\\theta\}\(h\_\{\\phi\}\(s\),a\)\\\|\_\{2\}^\{2\}\(19\)
By Rademacher complexity bounds:

supT^θ∈ℱT\|R^n​\(T^θ\)−R​\(T^θ\)\|≤2​ℛn​\(ℱT\)\+log⁡\(1/δ\)2​n\\sup\_\{\\hat\{T\}\_\{\\theta\}\\in\\mathcal\{F\}\_\{T\}\}\|\\hat\{R\}\_\{n\}\(\\hat\{T\}\_\{\\theta\}\)\-R\(\\hat\{T\}\_\{\\theta\}\)\|\\leq 2\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\_\{T\}\)\+\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2n\}\}\(20\)
For neural networks withddparameters:

ℛn​\(ℱT\)\\displaystyle\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\_\{T\}\)≤Cn​d​log⁡n\\displaystyle\\leq\\frac\{C\}\{\\sqrt\{n\}\}\\sqrt\{d\\log n\}\(21\)𝒩​\(ℱT,ϵ\)\\displaystyle\\mathcal\{N\}\(\\mathcal\{F\}\_\{T\},\\epsilon\)≤\(Cϵ\)d\\displaystyle\\leq\\left\(\\frac\{C\}\{\\epsilon\}\\right\)^\{d\}\(22\)
Combining:

\|R^n​\(T^θ\)−R​\(T^θ\)\|≤C​d​log⁡\(1/δ\)n\|\\hat\{R\}\_\{n\}\(\\hat\{T\}\_\{\\theta\}\)\-R\(\\hat\{T\}\_\{\\theta\}\)\|\\leq C\\sqrt\{\\frac\{d\\log\(1/\\delta\)\}\{n\}\}\(23\)∎

Now we prove Theorem[4](https://arxiv.org/html/2605.28842#Thmtheorem4):

###### Proof of Theorem[4](https://arxiv.org/html/2605.28842#Thmtheorem4)\.

From Lemma[1](https://arxiv.org/html/2605.28842#Thmlemma1):

‖T^θ​\(𝐳,a\)−T​\(𝐳,a\)‖2≤ϵmodel\\\|\\hat\{T\}\_\{\\theta\}\(\\mathbf\{z\},a\)\-T\(\\mathbf\{z\},a\)\\\|\_\{2\}\\leq\\epsilon\_\{\\text\{model\}\}\(24\)
Substituting the covering number bound:

ϵmodel\\displaystyle\\epsilon\_\{\\text\{model\}\}=C​log⁡\(𝒩​\(ℱT,ϵ\)\)\+log⁡\(1/δ\)n\\displaystyle=C\\sqrt\{\\frac\{\\log\(\\mathcal\{N\}\(\\mathcal\{F\}\_\{T\},\\epsilon\)\)\+\\log\(1/\\delta\)\}\{n\}\}\(25\)=C​d​log⁡\(1/ϵ\)\+log⁡\(1/δ\)n\\displaystyle=C\\sqrt\{\\frac\{d\\log\(1/\\epsilon\)\+\\log\(1/\\delta\)\}\{n\}\}\(26\)=O​\(d​log⁡\(1/δ\)n\)\\displaystyle=O\\left\(\\sqrt\{\\frac\{d\\log\(1/\\delta\)\}\{n\}\}\\right\)\(27\)
Settingϵ=ϵmodel\\epsilon=\\epsilon\_\{\\text\{model\}\}gives the result\. ∎

### A\.5Proof of Theorem[5](https://arxiv.org/html/2605.28842#Thmtheorem5)

We establish the optimality gap between the true optimal policy and the policy learned using our framework\.

###### Lemma 2\(Simulation Lemma\)\.

LetVH∗V\_\{H\}^\{\*\}be the optimal value function for the true MDP andV^H\\hat\{V\}\_\{H\}be the optimal value function for the learned MDP\. If the model error is bounded byδ\\delta, then:

\|VH∗​\(s\)−V^H​\(s\)\|≤H​δ\(1−γ\)2\|V\_\{H\}^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\\leq\\frac\{H\\delta\}\{\(1\-\\gamma\)^\{2\}\}\(28\)

###### Proof\.

Decompose the error:

\|VH∗​\(s\)−V^H​\(s\)\|\\displaystyle\|V\_\{H\}^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|≤\|VH∗​\(s\)−V∗​\(s\)\|\+\|V∗​\(s\)−V^H​\(s\)\|\\displaystyle\\leq\|V\_\{H\}^\{\*\}\(s\)\-V^\{\*\}\(s\)\|\+\|V^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\(29\)
Bound 1:Finite horizon error:

\|VH∗​\(s\)−V∗​\(s\)\|\\displaystyle\|V\_\{H\}^\{\*\}\(s\)\-V^\{\*\}\(s\)\|=\|∑t=H∞γt​𝔼​\[rt\]\|\\displaystyle=\\left\|\\sum\_\{t=H\}^\{\\infty\}\\gamma^\{t\}\\mathbb\{E\}\[r\_\{t\}\]\\right\|\(30\)≤∑t=H∞γt​Rmax\\displaystyle\\leq\\sum\_\{t=H\}^\{\\infty\}\\gamma^\{t\}R\_\{\\max\}\(31\)=γH​Rmax1−γ\\displaystyle=\\frac\{\\gamma^\{H\}R\_\{\\max\}\}\{1\-\\gamma\}\(32\)
Bound 2:Model error propagation:

\|V∗​\(s\)−V^H​\(s\)\|\\displaystyle\|V^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|≤∑t=0H−1γt∥𝒫\(⋅\|st,at\)−𝒫^\(⋅\|st,at\)∥1\\displaystyle\\leq\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}\\\|\\mathcal\{P\}\(\\cdot\|s\_\{t\},a\_\{t\}\)\-\\hat\{\\mathcal\{P\}\}\(\\cdot\|s\_\{t\},a\_\{t\}\)\\\|\_\{1\}\(33\)≤∑t=0H−1γt​δ\\displaystyle\\leq\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}\\delta\(34\)=δ​\(1−γH\)1−γ\\displaystyle=\\frac\{\\delta\(1\-\\gamma^\{H\}\)\}\{1\-\\gamma\}\(35\)≤H​δ1−γ\\displaystyle\\leq\\frac\{H\\delta\}\{1\-\\gamma\}\(36\)
Combining:

\|VH∗​\(s\)−V^H​\(s\)\|≤γH​Rmax1−γ\+H​δ1−γ≤H​δ\(1−γ\)2\|V\_\{H\}^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\\leq\\frac\{\\gamma^\{H\}R\_\{\\max\}\}\{1\-\\gamma\}\+\\frac\{H\\delta\}\{1\-\\gamma\}\\leq\\frac\{H\\delta\}\{\(1\-\\gamma\)^\{2\}\}\(37\)∎

###### Proof of Theorem[5](https://arxiv.org/html/2605.28842#Thmtheorem5)\.

From Lemma[2](https://arxiv.org/html/2605.28842#Thmlemma2):

Vπ∗​\(s0\)−Vπ^∗​\(s0\)≤ϵmodel​H\(1−γ\)2V^\{\\pi^\{\*\}\}\(s\_\{0\}\)\-V^\{\\hat\{\\pi\}^\{\*\}\}\(s\_\{0\}\)\\leq\\frac\{\\epsilon\_\{\\text\{model\}\}H\}\{\(1\-\\gamma\)^\{2\}\}\(38\)
Substitutingϵmodel\\epsilon\_\{\\text\{model\}\}from Theorem[4](https://arxiv.org/html/2605.28842#Thmtheorem4):

Vπ∗​\(s0\)−Vπ^∗​\(s0\)\\displaystyle V^\{\\pi^\{\*\}\}\(s\_\{0\}\)\-V^\{\\hat\{\\pi\}^\{\*\}\}\(s\_\{0\}\)≤H\(1−γ\)2⋅C​d​log⁡\(1/δ\)n\\displaystyle\\leq\\frac\{H\}\{\(1\-\\gamma\)^\{2\}\}\\cdot C\\sqrt\{\\frac\{d\\log\(1/\\delta\)\}\{n\}\}\(39\)=O​\(H​d​log⁡\(1/δ\)\(1−γ\)2​n\)\\displaystyle=O\\left\(\\frac\{H\\sqrt\{d\\log\(1/\\delta\)\}\}\{\(1\-\\gamma\)^\{2\}\\sqrt\{n\}\}\\right\)\(40\)∎

### A\.6Proof of Theorem[6](https://arxiv.org/html/2605.28842#Thmtheorem6)

We establish the sample complexity required to achieveϵ\\epsilon\-optimal policy\.

###### Proof of Theorem[6](https://arxiv.org/html/2605.28842#Thmtheorem6)\.

Forϵ\\epsilon\-optimality:

Vπ∗​\(s0\)−Vπ^∗​\(s0\)≤ϵV^\{\\pi^\{\*\}\}\(s\_\{0\}\)\-V^\{\\hat\{\\pi\}^\{\*\}\}\(s\_\{0\}\)\\leq\\epsilon\(41\)
From Theorem[5](https://arxiv.org/html/2605.28842#Thmtheorem5):

ϵmodel​H\(1−γ\)2\\displaystyle\\frac\{\\epsilon\_\{\\text\{model\}\}H\}\{\(1\-\\gamma\)^\{2\}\}≤ϵ\\displaystyle\\leq\\epsilon\(42\)ϵmodel\\displaystyle\\epsilon\_\{\\text\{model\}\}≤ϵ​\(1−γ\)2H\\displaystyle\\leq\\frac\{\\epsilon\(1\-\\gamma\)^\{2\}\}\{H\}\(43\)
From Theorem[4](https://arxiv.org/html/2605.28842#Thmtheorem4):

C​d​log⁡\(1/δ\)n\\displaystyle C\\sqrt\{\\frac\{d\\log\(1/\\delta\)\}\{n\}\}≤ϵ​\(1−γ\)2H\\displaystyle\\leq\\frac\{\\epsilon\(1\-\\gamma\)^\{2\}\}\{H\}\(44\)d​log⁡\(1/δ\)n\\displaystyle\\frac\{d\\log\(1/\\delta\)\}\{n\}≤ϵ2​\(1−γ\)4C2​H2\\displaystyle\\leq\\frac\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\{C^\{2\}H^\{2\}\}\(45\)n\\displaystyle n≥C2​H2​d​log⁡\(1/δ\)ϵ2​\(1−γ\)4\\displaystyle\\geq\\frac\{C^\{2\}H^\{2\}d\\log\(1/\\delta\)\}\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\(46\)
Therefore:

N=O​\(H2​d​log⁡\(1/δ\)ϵ2​\(1−γ\)4\)N=O\\left\(\\frac\{H^\{2\}d\\log\(1/\\delta\)\}\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\\right\)\(47\)∎

### A\.7Additional Theoretical Results

###### Proposition 1\(Multi\-Scale Convergence\)\.

The multi\-scale editing policy converges to the optimal policy at rateO​\(1/T\)O\(1/\\sqrt\{T\}\)whereTTis the number of editing steps\.

###### Proof\.

Multi\-scale editing as hierarchical MDP:

Vtmultiscale\\displaystyle V\_\{t\}^\{\\text\{multiscale\}\}=maxa∈𝒜​∑s∈\{token,step,structure\}πs​\(a\)​Vts​\(a\)\\displaystyle=\\max\_\{a\\in\\mathcal\{A\}\}\\sum\_\{s\\in\\\{\\text\{token,step,structure\}\\\}\}\\pi\_\{s\}\(a\)V\_\{t\}^\{s\}\(a\)\(48\)=∑sπs∗​Vts​\(as∗\)\\displaystyle=\\sum\_\{s\}\\pi\_\{s\}^\{\*\}V\_\{t\}^\{s\}\(a\_\{s\}^\{\*\}\)\(49\)
Convergence rate for each scaless:

\|Vts−V∗s\|≤Cst\|V\_\{t\}^\{s\}\-V\_\{\*\}^\{s\}\|\\leq\\frac\{C\_\{s\}\}\{\\sqrt\{t\}\}\(50\)
Combining scales:

\|Vtmultiscale−V∗multiscale\|\\displaystyle\|V\_\{t\}^\{\\text\{multiscale\}\}\-V\_\{\*\}^\{\\text\{multiscale\}\}\|≤∑sπs∗​\|Vts−V∗s\|\\displaystyle\\leq\\sum\_\{s\}\\pi\_\{s\}^\{\*\}\|V\_\{t\}^\{s\}\-V\_\{\*\}^\{s\}\|\(51\)≤∑sπs∗​Cst\\displaystyle\\leq\\sum\_\{s\}\\pi\_\{s\}^\{\*\}\\frac\{C\_\{s\}\}\{\\sqrt\{t\}\}\(52\)=Ct\\displaystyle=\\frac\{C\}\{\\sqrt\{t\}\}\(53\)∎

###### Corollary 1\(Computational Complexity\)\.

The computational complexity of Thoughts\-as\-Planning isO​\(H⋅\|𝒜\|⋅d\)O\(H\\cdot\|\\mathcal\{A\}\|\\cdot d\)per planning step, where\|𝒜\|\|\\mathcal\{A\}\|is the action space size\.

###### Proof\.

Per planning step complexity:

Tencode\\displaystyle T\_\{\\text\{encode\}\}=O​\(d\)\\displaystyle=O\(d\)\(54\)Tevaluate\\displaystyle T\_\{\\text\{evaluate\}\}=O​\(\|𝒜\|⋅d\)\\displaystyle=O\(\|\\mathcal\{A\}\|\\cdot d\)\(55\)Tplan\\displaystyle T\_\{\\text\{plan\}\}=O​\(H⋅d\)\\displaystyle=O\(H\\cdot d\)\(56\)Ttotal\\displaystyle T\_\{\\text\{total\}\}=Tencode\+Tevaluate\+Tplan\\displaystyle=T\_\{\\text\{encode\}\}\+T\_\{\\text\{evaluate\}\}\+T\_\{\\text\{plan\}\}\(57\)=O​\(d\)\+O​\(\|𝒜\|⋅d\)\+O​\(H⋅d\)\\displaystyle=O\(d\)\+O\(\|\\mathcal\{A\}\|\\cdot d\)\+O\(H\\cdot d\)\(58\)=O​\(H⋅\|𝒜\|⋅d\)\\displaystyle=O\(H\\cdot\|\\mathcal\{A\}\|\\cdot d\)\(59\)∎

### A\.8Comparison with Existing Methods

Table 5:Theoretical comparison with existing CoT optimization methodsMethodSample ComplexityConvergence RatePlanning HorizonRandom SearchO​\(\|𝒜\|T\)O\(\|\\mathcal\{A\}\|^\{T\}\)ExponentialTTGradient\-basedO​\(ϵ−2\)O\(\\epsilon^\{\-2\}\)O​\(1/n\)O\(1/\\sqrt\{n\}\)1Thoughts\-as\-PlanningO​\(H2​log⁡\(1/ϵ\)\)O\(H^\{2\}\\log\(1/\\epsilon\)\)O​\(1/n\)O\(1/\\sqrt\{n\}\)HHAssumption 1 \(Bounded State Space\):The latent state space is bounded:𝐳t∈ℬd​\(R\)\\mathbf\{z\}\_\{t\}\\in\\mathcal\{B\}\_\{d\}\(R\)for allttand someR\>0R\>0\.

Assumption 2 \(Lipschitz Continuity\):The transition function𝒫\\mathcal\{P\}and reward functionℛ\\mathcal\{R\}areLL\-Lipschitz continuous in the latent space\.

Assumption 3 \(Function Class Complexity\):The encoderhϕh\_\{\\phi\}, transition modelT^θ\\hat\{T\}\_\{\\theta\}, and reward predictorR^ψ\\hat\{R\}\_\{\\psi\}belong to function classes with finite VC dimension or Rademacher complexity\.

### A\.9Proof of Theorem 1: Convergence of Latent World Model

Theorem 1\.Under Assumptions A1\-A4, the learned latent world modelT^θ\\hat\{T\}\_\{\\theta\}converges to the true transition dynamicsT∗T^\{\*\}with probability1−δ1\-\\deltaafterN≥O~​\(d2​log⁡\(1/δ\)ϵ2\)N\\geq\\tilde\{O\}\(\\frac\{d^\{2\}\\log\(1/\\delta\)\}\{\\epsilon^\{2\}\}\)training samples, whereϵ\\epsilonis the approximation error bound\.

Assumptions:

- •A1: The true transition functionT∗T^\{\*\}is Lipschitz continuous with constantLTL\_\{T\}
- •A2: The latent space has bounded diameter:diam​\(𝒵\)≤D\\text\{diam\}\(\\mathcal\{Z\}\)\\leq D
- •A3: The reward function is bounded:\|R∗​\(z\)\|≤Rmax\|R^\{\*\}\(z\)\|\\leq R\_\{\\max\}for allz∈𝒵z\\in\\mathcal\{Z\}
- •A4: The training data covers the latent space sufficiently:∀z∈𝒵,∃\\forall z\\in\\mathcal\{Z\},\\existstraining sample within distanceρ\\rho

Proof:We use the simulation lemma approach\. LetVT^V^\{\\hat\{T\}\}andVT∗V^\{T^\{\*\}\}be the value functions under the learned and true models respectively\. Then:

\|VT^​\(z\)−VT∗​\(z\)\|\\displaystyle\|V^\{\\hat\{T\}\}\(z\)\-V^\{T^\{\*\}\}\(z\)\|≤∑t=0H−1γt​𝔼​\[\|R​\(T^​\(zt,at\)\)−R​\(T∗​\(zt,at\)\)\|\]\\displaystyle\\leq\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}\\mathbb\{E\}\[\|R\(\\hat\{T\}\(z\_\{t\},a\_\{t\}\)\)\-R\(T^\{\*\}\(z\_\{t\},a\_\{t\}\)\)\|\]\(60\)≤Rmax​∑t=0H−1γt​𝔼​\[‖T^​\(zt,at\)−T∗​\(zt,at\)‖2\]\\displaystyle\\leq R\_\{\\max\}\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}\\mathbb\{E\}\[\|\|\\hat\{T\}\(z\_\{t\},a\_\{t\}\)\-T^\{\*\}\(z\_\{t\},a\_\{t\}\)\|\|\_\{2\}\]\(61\)≤Rmax​∑t=0H−1γt​LT​ϵ\\displaystyle\\leq R\_\{\\max\}\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}L\_\{T\}\\epsilon\(62\)=Rmax​LT​ϵ​1−γH1−γ\\displaystyle=R\_\{\\max\}L\_\{T\}\\epsilon\\frac\{1\-\\gamma^\{H\}\}\{1\-\\gamma\}\(63\)
By Hoeffding’s inequality and the covering assumption, with probability1−δ1\-\\delta:

ϵ≤2​log⁡\(2​d/δ\)N\+ρ​LT\\epsilon\\leq\\sqrt\{\\frac\{2\\log\(2d/\\delta\)\}\{N\}\}\+\\rho L\_\{T\}\(64\)
Settingϵ=ϵtargetRmax​LT​1−γH1−γ\\epsilon=\\frac\{\\epsilon\_\{\\text\{target\}\}\}\{R\_\{\\max\}L\_\{T\}\\frac\{1\-\\gamma^\{H\}\}\{1\-\\gamma\}\}and solving forNNgives the required sample complexity\.

### A\.10Proof of Theorem 2: Planning Optimality

Theorem 2\.Letπ∗\\pi^\{\*\}be the optimal policy under the true model andπ^\\hat\{\\pi\}be the policy returned by our planning algorithm\. Then with probability1−δ1\-\\delta:

Vπ^≥Vπ∗−ϵplanning−ϵmodelV^\{\\hat\{\\pi\}\}\\geq V^\{\\pi^\{\*\}\}\-\\epsilon\_\{\\text\{planning\}\}\-\\epsilon\_\{\\text\{model\}\}\(65\)
whereϵmodel\\epsilon\_\{\\text\{model\}\}is the model approximation error andϵplanning=O​\(1K\)\\epsilon\_\{\\text\{planning\}\}=O\(\\frac\{1\}\{\\sqrt\{K\}\}\)is the planning error\.

Proof:We analyze the planning error using concentration inequalities\. LetVKV\_\{K\}be the value estimated fromKKrollouts\. Then:

𝔼​\[VK\]\\displaystyle\\mathbb\{E\}\[V\_\{K\}\]=𝔼​\[1K​∑i=1KVi\]=V∗\\displaystyle=\\mathbb\{E\}\[\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}V\_\{i\}\]=V^\{\*\}\(66\)Var​\[VK\]\\displaystyle\\text\{Var\}\[V\_\{K\}\]=1K2​∑i=1KVar​\[Vi\]≤Vmax2K\\displaystyle=\\frac\{1\}\{K^\{2\}\}\\sum\_\{i=1\}^\{K\}\\text\{Var\}\[V\_\{i\}\]\\leq\\frac\{V\_\{\\max\}^\{2\}\}\{K\}\(67\)
By Bernstein’s inequality:

ℙ​\[\|VK−V∗\|≥t\]≤2​exp⁡\(−K​t22​Vmax2\+2​Vmax​t/3\)\\mathbb\{P\}\[\|V\_\{K\}\-V^\{\*\}\|\\geq t\]\\leq 2\\exp\\left\(\-\\frac\{Kt^\{2\}\}\{2V\_\{\\max\}^\{2\}\+2V\_\{\\max\}t/3\}\\right\)\(68\)
Settingt=2​Vmax2​log⁡\(2/δ\)Kt=\\sqrt\{\\frac\{2V\_\{\\max\}^\{2\}\\log\(2/\\delta\)\}\{K\}\}gives the planning error bound\.

### A\.11Proof of Theorem 3: Multi\-Scale Convergence

Theorem 3\.The multi\-scale editing policy converges to a near\-optimal policy with convergence rateO​\(1/T\)O\(1/\\sqrt\{T\}\)whereTTis the number of planning steps\.

Proof:We use the analysis of multi\-scale optimization\. Let𝒜k\\mathcal\{A\}\_\{k\}be the action space at scalekk\. The convergence rate depends on the exploration\-exploitation trade\-off:

𝔼​\[RT\]≥T⋅R∗−∑k=1KTk​log⁡\|𝒜k\|\\mathbb\{E\}\[R\_\{T\}\]\\geq T\\cdot R^\{\*\}\-\\sum\_\{k=1\}^\{K\}\\sqrt\{T\_\{k\}\\log\|\\mathcal\{A\}\_\{k\}\|\}\(69\)
whereTkT\_\{k\}is the time spent at scalekk\. Optimizing the allocation gives the stated convergence rate\.

### A\.12Additional Theoretical Results

Lemma 1 \(Sample Complexity\):The sample complexity for learning the latent world model scales asO~​\(d2/ϵ2\)\\tilde\{O\}\(d^\{2\}/\\epsilon^\{2\}\)whereddis the latent dimension\.

Lemma 2 \(Generalization Bound\):The generalization error of the reward predictor is bounded by:

𝔼​\[\|R^​\(z\)−R∗​\(z\)\|\]≤ℛn​\(ℱ\)\+log⁡\(1/δ\)2​n\\mathbb\{E\}\[\|\\hat\{R\}\(z\)\-R^\{\*\}\(z\)\|\]\\leq\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\)\+\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2n\}\}\(70\)whereℛn​\(ℱ\)\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\)is the Rademacher complexity of the function class\.

### A\.13Computational Complexity Analysis

Theorem 4:The computational complexity of our method is:

- •Training:O​\(n⋅d2⋅H⋅K\)O\(n\\cdot d^\{2\}\\cdot H\\cdot K\)wherennis the number of training samples
- •Inference:O​\(d2\+H⋅K⋅d\)O\(d^\{2\}\+H\\cdot K\\cdot d\)per reasoning chain
- •Memory:O​\(d2\)O\(d^\{2\}\)for model parameters

Proof:The complexity analysis follows from the architecture:

- •Latent encoder:O​\(n⋅d2\)O\(n\\cdot d^\{2\}\)for transformer layers
- •Transition model:O​\(n⋅d2⋅H\)O\(n\\cdot d^\{2\}\\cdot H\)for rollout training
- •Planning:O​\(H⋅K⋅d\)O\(H\\cdot K\\cdot d\)forKKrollouts of horizonHH

### A\.14Statistical Analysis of Convergence

We provide confidence intervals for our convergence results using empirical process theory\.

Theorem 5 \(Confidence Intervals\):With probability1−δ1\-\\delta, the planning performance satisfies:

Vπ^∈\[Vπ∗−ϵ−c​log⁡\(1/δ\)n,Vπ∗\+ϵ\+c​log⁡\(1/δ\)n\]V^\{\\hat\{\\pi\}\}\\in\[V^\{\\pi^\{\*\}\}\-\\epsilon\-c\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{n\}\},V^\{\\pi^\{\*\}\}\+\\epsilon\+c\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{n\}\}\]\(71\)whereccis a constant depending on the function class complexity\.

### A\.15Robustness Analysis

Theorem 6 \(Robustness to Model Mismatch\):If the learned modelT^\\hat\{T\}satisfies‖T^​\(z,a\)−T∗​\(z,a\)‖≤ϵ\|\|\\hat\{T\}\(z,a\)\-T^\{\*\}\(z,a\)\|\|\\leq\\epsilonfor all\(z,a\)\(z,a\), then the planning performance degrades gracefully:

\|Vπ^−Vπ∗\|≤ϵ​Rmax1−γ\|V^\{\\hat\{\\pi\}\}\-V^\{\\pi^\{\*\}\}\|\\leq\\frac\{\\epsilon R\_\{\\max\}\}\{1\-\\gamma\}\(72\)

### A\.16Comparison with Existing Methods

We provide theoretical comparison with existing chain\-of\-thoughts optimization methods:

Random Search:Our method achievesO​\(log⁡K\)O\(\\sqrt\{\\log K\}\)improvement over random search in the number of queries required to reachϵ\\epsilon\-optimal performance\.

Greedy Methods:Our planning approach providesO​\(H\)O\(H\)improvement over greedy methods in terms of solution quality for horizonHHproblems\.

We introduce the following notation for our analysis:

- •ℬd​\(r\)=\{𝐳∈ℝd:‖𝐳‖2≤r\}\\mathcal\{B\}\_\{d\}\(r\)=\\\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}:\\\|\\mathbf\{z\}\\\|\_\{2\}\\leq r\\\}denotes thedd\-dimensional ball of radiusrr
- •𝒩​\(ℱ,ϵ\)\\mathcal\{N\}\(\\mathcal\{F\},\\epsilon\)denotes theϵ\\epsilon\-covering number of function classℱ\\mathcal\{F\}
- •∥⋅∥ℱ\\\|\\cdot\\\|\_\{\\mathcal\{F\}\}denotes the supremum norm over function classℱ\\mathcal\{F\}
- •C,cC,cdenote universal constants that may vary between contexts
- •O~​\(⋅\)\\tilde\{O\}\(\\cdot\)hides logarithmic factors

We make the following standard assumptions:

Assumption 1 \(Bounded State Space\):The latent state space is bounded:𝐳t∈ℬd​\(R\)\\mathbf\{z\}\_\{t\}\\in\\mathcal\{B\}\_\{d\}\(R\)for allttand someR\>0R\>0\.

Assumption 2 \(Lipschitz Continuity\):The transition function𝒫\\mathcal\{P\}and reward functionℛ\\mathcal\{R\}areLL\-Lipschitz continuous in the latent space\.

Assumption 3 \(Function Class Complexity\):The encoderhϕh\_\{\\phi\}, transition modelT^θ\\hat\{T\}\_\{\\theta\}, and reward predictorR^ψ\\hat\{R\}\_\{\\psi\}belong to function classes with finite VC dimension or Rademacher complexity\.

### A\.17Proof of Theorem 1: Convergence of Latent World Model

###### Proof\.

We prove convergence of the learned transition modelT^θ\\hat\{T\}\_\{\\theta\}to the true transition dynamics𝒫\\mathcal\{P\}\.

LetℱT=\{T^θ:θ∈Θ\}\\mathcal\{F\}\_\{T\}=\\\{\\hat\{T\}\_\{\\theta\}:\\theta\\in\\Theta\\\}be the function class of transition models\. Define the empirical risk:

R^n​\(T^θ\)=1n​∑i=1n‖hϕ​\(si\+1\)−T^θ​\(hϕ​\(si\),ai\)‖22\\hat\{R\}\_\{n\}\(\\hat\{T\}\_\{\\theta\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|h\_\{\\phi\}\(s\_\{i\+1\}\)\-\\hat\{T\}\_\{\\theta\}\(h\_\{\\phi\}\(s\_\{i\}\),a\_\{i\}\)\\\|\_\{2\}^\{2\}\(73\)
and the population risk:

R​\(T^θ\)=𝔼\(s,a,s′\)∼𝒟​‖hϕ​\(s′\)−T^θ​\(hϕ​\(s\),a\)‖22R\(\\hat\{T\}\_\{\\theta\}\)=\\mathbb\{E\}\_\{\(s,a,s^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\\|h\_\{\\phi\}\(s^\{\\prime\}\)\-\\hat\{T\}\_\{\\theta\}\(h\_\{\\phi\}\(s\),a\)\\\|\_\{2\}^\{2\}\(74\)
By standard uniform convergence results for regression, with probability at least1−δ1\-\\delta:

supT^θ∈ℱT\|R^n​\(T^θ\)−R​\(T^θ\)\|≤C​log⁡\(𝒩​\(ℱT,ϵ\)\)\+log⁡\(1/δ\)n\\sup\_\{\\hat\{T\}\_\{\\theta\}\\in\\mathcal\{F\}\_\{T\}\}\|\\hat\{R\}\_\{n\}\(\\hat\{T\}\_\{\\theta\}\)\-R\(\\hat\{T\}\_\{\\theta\}\)\|\\leq C\\sqrt\{\\frac\{\\log\(\\mathcal\{N\}\(\\mathcal\{F\}\_\{T\},\\epsilon\)\)\+\\log\(1/\\delta\)\}\{n\}\}\(75\)
where𝒩​\(ℱT,ϵ\)\\mathcal\{N\}\(\\mathcal\{F\}\_\{T\},\\epsilon\)is theϵ\\epsilon\-covering number ofℱT\\mathcal\{F\}\_\{T\}\.

Since our transition model is parameterized by a neural network with bounded weights, the covering number satisfies:

log⁡\(𝒩​\(ℱT,ϵ\)\)≤C⋅dim​\(Θ\)​log⁡\(1/ϵ\)\\log\(\\mathcal\{N\}\(\\mathcal\{F\}\_\{T\},\\epsilon\)\)\\leq C\\cdot\\text\{dim\}\(\\Theta\)\\log\(1/\\epsilon\)\(76\)
Therefore, the generalization error decays asO​\(log⁡n/n\)O\(\\sqrt\{\\log n/n\}\), which gives us the desiredO​\(1/n\)O\(1/\\sqrt\{n\}\)convergence rate\.

LetT^∗=arg⁡minT^θ∈ℱT⁡R​\(T^θ\)\\hat\{T\}^\{\*\}=\\arg\\min\_\{\\hat\{T\}\_\{\\theta\}\\in\\mathcal\{F\}\_\{T\}\}R\(\\hat\{T\}\_\{\\theta\}\)be the best transition model in our function class\. By the triangle inequality:

‖hϕ​\(s′\)−T^θ^​\(hϕ​\(s\),a\)‖2\\displaystyle\\\|h\_\{\\phi\}\(s^\{\\prime\}\)\-\\hat\{T\}\_\{\\hat\{\\theta\}\}\(h\_\{\\phi\}\(s\),a\)\\\|\_\{2\}≤‖hϕ​\(s′\)−T^∗​\(hϕ​\(s\),a\)‖2\+‖T^∗​\(hϕ​\(s\),a\)−T^θ^​\(hϕ​\(s\),a\)‖2\\displaystyle\\leq\\\|h\_\{\\phi\}\(s^\{\\prime\}\)\-\\hat\{T\}^\{\*\}\(h\_\{\\phi\}\(s\),a\)\\\|\_\{2\}\+\\\|\\hat\{T\}^\{\*\}\(h\_\{\\phi\}\(s\),a\)\-\\hat\{T\}\_\{\\hat\{\\theta\}\}\(h\_\{\\phi\}\(s\),a\)\\\|\_\{2\}\(77\)≤ϵapprox\+ϵest\\displaystyle\\leq\\epsilon\_\{\\text\{approx\}\}\+\\epsilon\_\{\\text\{est\}\}\(78\)
whereϵapprox=minT^θ∈ℱT⁡𝔼​\[‖hϕ​\(s′\)−T^θ​\(hϕ​\(s\),a\)‖2\]\\epsilon\_\{\\text\{approx\}\}=\\min\_\{\\hat\{T\}\_\{\\theta\}\\in\\mathcal\{F\}\_\{T\}\}\\mathbb\{E\}\[\\\|h\_\{\\phi\}\(s^\{\\prime\}\)\-\\hat\{T\}\_\{\\theta\}\(h\_\{\\phi\}\(s\),a\)\\\|\_\{2\}\]is the approximation error andϵest=O​\(log⁡n/n\)\\epsilon\_\{\\text\{est\}\}=O\(\\sqrt\{\\log n/n\}\)is the estimation error\.

This completes the proof of convergence with rateO​\(1/n\)O\(1/\\sqrt\{n\}\)\. ∎

### A\.18Proof of Theorem 2: Planning Optimality

###### Proof\.

We establish that our planning algorithm achievesϵ\\epsilon\-optimal performance with the stated sample complexity\.

LetV∗​\(s\)V^\{\*\}\(s\)denote the optimal value function andV^H​\(s\)\\hat\{V\}\_\{H\}\(s\)denote the value function obtained byHH\-step planning with our learned model\. We analyze the approximation error:

\|V∗​\(s\)−V^H​\(s\)\|≤\|V∗​\(s\)−VH∗​\(s\)\|\+\|VH∗​\(s\)−V^H​\(s\)\|\|V^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\\leq\|V^\{\*\}\(s\)\-V\_\{H\}^\{\*\}\(s\)\|\+\|V\_\{H\}^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\(79\)
whereVH∗​\(s\)V\_\{H\}^\{\*\}\(s\)is the optimal value function for theHH\-horizon problem\.

Bound 1: Finite Horizon ErrorFor the first term, by standard MDP theory:

\|V∗​\(s\)−VH∗​\(s\)\|≤γH​Rmax1−γ\|V^\{\*\}\(s\)\-V\_\{H\}^\{\*\}\(s\)\|\\leq\\frac\{\\gamma^\{H\}R\_\{\\max\}\}\{1\-\\gamma\}\(80\)
whereRmaxR\_\{\\max\}is the maximum reward\. SettingH≥log⁡\(ϵ​\(1−γ\)/Rmax\)log⁡γH\\geq\\frac\{\\log\(\\epsilon\(1\-\\gamma\)/R\_\{\\max\}\)\}\{\\log\\gamma\}makes this term≤ϵ/2\\leq\\epsilon/2\.

Bound 2: Model ErrorFor the second term, we use the fact that our learned model has errorδ=O​\(1/n\)\\delta=O\(1/\\sqrt\{n\}\)\. By the simulation lemma:

\|VH∗​\(s\)−V^H​\(s\)\|≤H​δ\(1−γ\)2\|V\_\{H\}^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\\leq\\frac\{H\\delta\}\{\(1\-\\gamma\)^\{2\}\}\(81\)
Settingδ≤ϵ​\(1−γ\)22​H\\delta\\leq\\frac\{\\epsilon\(1\-\\gamma\)^\{2\}\}\{2H\}requires:

n≥4​H2ϵ2​\(1−γ\)4​log⁡\(4​H2ϵ2​\(1−γ\)4\)n\\geq\\frac\{4H^\{2\}\}\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\\log\\left\(\\frac\{4H^\{2\}\}\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\\right\)\(82\)
Combining both bounds, we achieveϵ\\epsilon\-optimality with sample complexity:

O​\(H2​log⁡\(1/ϵ\)ϵ2​\(1−γ\)4\)O\\left\(\\frac\{H^\{2\}\\log\(1/\\epsilon\)\}\{\\epsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\\right\)\(83\)
For the specific case mentioned in the theorem where the model is sufficiently accurate, the sample complexity reduces toO​\(H2​log⁡\(1/ϵ\)\)O\(H^\{2\}\\log\(1/\\epsilon\)\)\. ∎

### A\.19Proof of Theorem 3: Multi\-Scale Convergence

###### Proof\.

We show that multi\-scale editing enables faster convergence compared to single\-scale approaches\.

Let𝒜single\\mathcal\{A\}\_\{\\text\{single\}\}denote a single\-scale action space and𝒜multi=⋃i=1L𝒜i\\mathcal\{A\}\_\{\\text\{multi\}\}=\\bigcup\_\{i=1\}^\{L\}\\mathcal\{A\}\_\{i\}denote our multi\-scale action space withLLlevels\.

The key insight is that multi\-scale actions can make larger improvements in fewer steps\. Letdid\_\{i\}be the diameter of action space𝒜i\\mathcal\{A\}\_\{i\}\(maximum distance between any two states reachable via actions in𝒜i\\mathcal\{A\}\_\{i\}\)\.

Single\-Scale AnalysisFor single\-scale editing with action space𝒜single\\mathcal\{A\}\_\{\\text\{single\}\}, the convergence rate is:

Regretsingle​\(T\)=O​\(T⋅dsingle⋅log⁡\|𝒜single\|\)\\text\{Regret\}\_\{\\text\{single\}\}\(T\)=O\(\\sqrt\{T\\cdot d\_\{\\text\{single\}\}\\cdot\\log\|\\mathcal\{A\}\_\{\\text\{single\}\}\|\}\)\(84\)
Multi\-Scale AnalysisFor multi\-scale editing, we can use a hierarchical approach:

1. 1\.Use coarse actions \(largedid\_\{i\}\) for initial exploration
2. 2\.Switch to fine actions \(smalldid\_\{i\}\) for refinement

The convergence rate becomes:

Regretmulti​\(T\)=O​\(T⋅mini⁡di⋅log⁡\|𝒜multi\|\+L⋅log⁡T\)\\text\{Regret\}\_\{\\text\{multi\}\}\(T\)=O\\left\(\\sqrt\{T\\cdot\\min\_\{i\}d\_\{i\}\\cdot\\log\|\\mathcal\{A\}\_\{\\text\{multi\}\}\|\}\+L\\cdot\\log T\\right\)\(85\)
Improvement FactorThe improvement factor is:

Regretsingle​\(T\)Regretmulti​\(T\)=dsingle⋅log⁡\|𝒜single\|mini⁡di⋅log⁡\|𝒜multi\|\+L⋅log⁡T/T\\frac\{\\text\{Regret\}\_\{\\text\{single\}\}\(T\)\}\{\\text\{Regret\}\_\{\\text\{multi\}\}\(T\)\}=\\frac\{\\sqrt\{d\_\{\\text\{single\}\}\\cdot\\log\|\\mathcal\{A\}\_\{\\text\{single\}\}\|\}\}\{\\sqrt\{\\min\_\{i\}d\_\{i\}\\cdot\\log\|\\mathcal\{A\}\_\{\\text\{multi\}\}\|\}\+L\\cdot\\log T/\\sqrt\{T\}\}\(86\)
Sincemini⁡di≪dsingle\\min\_\{i\}d\_\{i\}\\ll d\_\{\\text\{single\}\}andLLis typically small \(e\.g\.,L=3L=3for token/step/structure levels\), we get a significant improvement factor proportional todsingle/mini⁡di\\sqrt\{d\_\{\\text\{single\}\}/\\min\_\{i\}d\_\{i\}\}\.

For our specific multi\-scale design wheredstructure≫dstep≫dtokend\_\{\\text\{structure\}\}\\gg d\_\{\\text\{step\}\}\\gg d\_\{\\text\{token\}\}, the improvement factor is approximatelyL\\sqrt\{L\}, demonstrating the benefit of hierarchical reasoning chain optimization\. ∎

### A\.20Statistical Analysis of Convergence

#### A\.20\.1Rate of Convergence Analysis

###### Theorem 7\(Detailed Convergence Rate\)\.

Under the assumptions stated above, the convergence rate of the latent world model is:

𝔼​\[‖T^θ−𝒫‖22\]≤C​\(d​log⁡nn\)1/2\+ϵapprox\\mathbb\{E\}\[\\\|\\hat\{T\}\_\{\\theta\}\-\\mathcal\{P\}\\\|\_\{2\}^\{2\}\]\\leq C\\left\(\\frac\{d\\log n\}\{n\}\\right\)^\{1/2\}\+\\epsilon\_\{\\text\{approx\}\}\(87\)whereϵapprox\\epsilon\_\{\\text\{approx\}\}is the approximation error of the function class\.

###### Proof\.

The proof follows from standard empirical process theory\. Letℱ\\mathcal\{F\}be our function class with VC dimensiondd\. By the symmetrization inequality and Rademacher complexity bounds:

𝔼​\[supf∈ℱ\|R^n​\(f\)−R​\(f\)\|\]≤2​𝔼​\[ℛn​\(ℱ\)\]\\mathbb\{E\}\[\\sup\_\{f\\in\\mathcal\{F\}\}\|\\hat\{R\}\_\{n\}\(f\)\-R\(f\)\|\]\\leq 2\\mathbb\{E\}\[\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\)\]\(88\)
whereℛn​\(ℱ\)\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\)is the Rademacher complexity\. For function classes with VC dimensiondd, we have:

ℛn​\(ℱ\)≤C​d​log⁡nn\\mathcal\{R\}\_\{n\}\(\\mathcal\{F\}\)\\leq C\\sqrt\{\\frac\{d\\log n\}\{n\}\}\(89\)
Combining with the approximation error gives the desired result\. ∎

#### A\.20\.2Confidence Intervals for Planning

###### Lemma 3\(Planning Confidence Intervals\)\.

With probability at least1−δ1\-\\delta, the planning error is bounded by:

\|V^H​\(s\)−V∗​\(s\)\|≤C​H​log⁡\(1/δ\)n\+γH​Rmax1−γ\|\\hat\{V\}\_\{H\}\(s\)\-V^\{\*\}\(s\)\|\\leq C\\sqrt\{\\frac\{H\\log\(1/\\delta\)\}\{n\}\}\+\\frac\{\\gamma^\{H\}R\_\{\\max\}\}\{1\-\\gamma\}\(90\)

###### Proof\.

This follows from combining the model error bound with the finite horizon error, using concentration inequalities for the empirical estimates\. ∎

### A\.21Robustness Analysis

#### A\.21\.1Sensitivity to Model Errors

###### Proposition 2\(Robustness to Model Perturbations\)\.

If the learned model has error‖T^θ−𝒫‖∞≤ϵ\\\|\\hat\{T\}\_\{\\theta\}\-\\mathcal\{P\}\\\|\_\{\\infty\}\\leq\\epsilon, then the planning performance degrades by at most:

\|V∗​\(s\)−V^H​\(s\)\|≤H​ϵ\(1−γ\)2\|V^\{\*\}\(s\)\-\\hat\{V\}\_\{H\}\(s\)\|\\leq\\frac\{H\\epsilon\}\{\(1\-\\gamma\)^\{2\}\}\(91\)

###### Proof\.

This follows from the simulation lemma and the fact that errors accumulate linearly with the planning horizonHH\. ∎

### A\.22Comparison with Existing Methods

#### A\.22\.1Sample Complexity Comparison

Table 6:Sample complexity comparison with existing CoT optimization methodsMethodSample ComplexityConvergence RatePlanning HorizonRandom SearchO​\(\|𝒜\|T\)O\(\|\\mathcal\{A\}\|^\{T\}\)ExponentialTTGradient\-basedO​\(ϵ−2\)O\(\\epsilon^\{\-2\}\)O​\(1/n\)O\(1/\\sqrt\{n\}\)1Thoughts\-as\-PlanningO​\(H2​log⁡\(1/ϵ\)\)O\(H^\{2\}\\log\(1/\\epsilon\)\)O​\(1/n\)O\(1/\\sqrt\{n\}\)HH

## Appendix BDetailed Experiments

### B\.1Edit\-type distribution and transfer scenarios

Table 7:Distribution of edit types and their average reward gains with statistical significance\.Edit TypeFrequency \(%\)Avg Reward GainSignificanceReorder examples28\.2±2\.1\+2\.3±0\.4p<0\.01p<0\.01Replace instruction22\.7±1\.8\+1\.9±0\.3p<0\.01p<0\.01Token deletion17\.5±1\.5\+1\.1±0\.2p<0\.05p<0\.05Span rewriting15\.4±1\.3\+1\.6±0\.3p<0\.01p<0\.01Add clarifier token11\.8±1\.2\+0\.9±0\.2p<0\.05p<0\.05Table 8:Transfer learning results across different scenarios\.Transfer ScenarioTarget TaskPerformanceQueriesZero\-shot \(SuperNI → AlpacaEval\)AlpacaEval73\.4±0\.845Few\-shot \(5 examples\)AlpacaEval75\.2±0\.742Cross\-domain \(XSum → Reddit\)Reddit TL;DR21\.8±0\.348Cross\-task \(SuperNI → PIQA\)PIQA79\.8±0\.652Fine\-tuned \(SuperNI → AlpacaEval\)AlpacaEval76\.1±0\.638
### B\.2Comprehensive Ablation Study

We conduct extensive ablation studies to understand the contribution of each component across multiple configurations:

Table 9:Comprehensive ablation study across multiple configurations, datasets, and model scales\. Results show mean ± std over 5 runs\.Δ\\Deltaindicates relative improvement over baseline\. Best configurations arebolded\.ConfigurationParams\(M\)Performance MetricsEfficiency MetricsSuperNIXSumHumanEvalLatency\(ms\)Memory\(GB\)Throughput\(prompts/hr\)Acc \(%\)Δ\\DeltaR\-L \(%\)Δ\\DeltaPass@1 \(%\)Δ\\DeltaComponent Ablation \(Base Model\)Baseline \(RandomEdit\)15\.278\.3±0\.978\.3\\pm 0\.9\-29\.3±0\.429\.3\\pm 0\.4\-18\.2±1\.118\.2\\pm 1\.1\-850±28850\\pm 286\.8±0\.26\.8\\pm 0\.21\.431\.43Individual Components\+ Latent Model Only18\.979\.2±0\.679\.2\\pm 0\.6\+0\.930\.8±0\.330\.8\\pm 0\.3\+1\.520\.1±0\.820\.1\\pm 0\.8\+1\.9450±18450\\pm 185\.2±0\.25\.2\\pm 0\.22\.222\.22\+ Reward Predictor Only20\.379\.5±0\.779\.5\\pm 0\.7\+1\.231\.0±0\.431\.0\\pm 0\.4\+1\.720\.8±0\.920\.8\\pm 0\.9\+2\.6520±20520\\pm 206\.1±0\.26\.1\\pm 0\.21\.921\.92\+ Multi\-scale Edits Only21\.880\.8±0\.580\.8\\pm 0\.5\+2\.531\.5±0\.331\.5\\pm 0\.3\+2\.222\.1±0\.722\.1\\pm 0\.7\+3\.9580±22580\\pm 226\.8±0\.26\.8\\pm 0\.21\.721\.72\+ Planning Horizon Only22\.181\.2±0\.681\.2\\pm 0\.6\+2\.932\.4±0\.332\.4\\pm 0\.3\+3\.123\.5±0\.823\.5\\pm 0\.8\+5\.3650±25650\\pm 257\.2±0\.27\.2\\pm 0\.21\.541\.54Pairwise Component CombinationsLatent \+ Reward19\.680\.1±0\.580\.1\\pm 0\.5\+1\.831\.3±0\.431\.3\\pm 0\.4\+2\.021\.2±0\.721\.2\\pm 0\.7\+3\.0420±16420\\pm 165\.8±0\.25\.8\\pm 0\.22\.382\.38Latent \+ Multi\-scale20\.381\.4±0\.681\.4\\pm 0\.6\+3\.132\.2±0\.332\.2\\pm 0\.3\+2\.923\.8±0\.823\.8\\pm 0\.8\+5\.6480±18480\\pm 186\.2±0\.26\.2\\pm 0\.22\.082\.08Latent \+ Planning21\.881\.8±0\.481\.8\\pm 0\.4\+3\.532\.8±0\.332\.8\\pm 0\.3\+3\.524\.5±0\.724\.5\\pm 0\.7\+6\.3540±20540\\pm 207\.0±0\.27\.0\\pm 0\.21\.851\.85Reward \+ Multi\-scale22\.181\.2±0\.581\.2\\pm 0\.5\+2\.932\.1±0\.432\.1\\pm 0\.4\+2\.823\.2±0\.823\.2\\pm 0\.8\+5\.0520±19520\\pm 196\.8±0\.26\.8\\pm 0\.21\.921\.92Reward \+ Planning22\.581\.6±0\.481\.6\\pm 0\.4\+3\.332\.5±0\.332\.5\\pm 0\.3\+3\.224\.1±0\.724\.1\\pm 0\.7\+5\.9560±21560\\pm 217\.1±0\.27\.1\\pm 0\.21\.791\.79Multi\-scale \+ Planning23\.282\.1±0\.582\.1\\pm 0\.5\+3\.833\.2±0\.333\.2\\pm 0\.3\+3\.925\.8±0\.825\.8\\pm 0\.8\+7\.6600±22600\\pm 227\.5±0\.27\.5\\pm 0\.21\.671\.67Three\-Component Combinationsw/o Planning Horizon21\.581\.3±0\.581\.3\\pm 0\.5\+3\.032\.3±0\.432\.3\\pm 0\.4\+3\.023\.9±0\.723\.9\\pm 0\.7\+5\.7520±19520\\pm 196\.8±0\.26\.8\\pm 0\.21\.921\.92w/o Multi\-scale Edits22\.181\.4±0\.481\.4\\pm 0\.4\+3\.132\.4±0\.332\.4\\pm 0\.3\+3\.124\.0±0\.624\.0\\pm 0\.6\+5\.8540±20540\\pm 207\.0±0\.27\.0\\pm 0\.21\.851\.85w/o Reward Predictor20\.380\.8±0\.580\.8\\pm 0\.5\+2\.531\.8±0\.431\.8\\pm 0\.4\+2\.522\.5±0\.722\.5\\pm 0\.7\+4\.3480±18480\\pm 186\.2±0\.26\.2\\pm 0\.22\.082\.08w/o Latent Model22\.880\.2±0\.480\.2\\pm 0\.4\+1\.931\.5±0\.331\.5\\pm 0\.3\+2\.221\.8±0\.621\.8\\pm 0\.6\+3\.6560±21560\\pm 217\.2±0\.27\.2\\pm 0\.21\.791\.79Full ConfigurationFull Prompt\-as\-Planning23\.883\.6±0\.583\.6\\pm 0\.5\+5\.333\.7±0\.233\.7\\pm 0\.2\+4\.428\.7±1\.428\.7\\pm 1\.4\+10\.5420±15420\\pm 158\.2±0\.28\.2\\pm 0\.22\.632\.63Hyperparameter Ablation \(Planning HorizonHH\)H=1H=123\.881\.8±0\.581\.8\\pm 0\.5\+3\.532\.5±0\.332\.5\\pm 0\.3\+3\.225\.2±1\.125\.2\\pm 1\.1\+7\.0380±14380\\pm 147\.8±0\.27\.8\\pm 0\.22\.942\.94H=2H=223\.882\.4±0\.482\.4\\pm 0\.4\+4\.133\.0±0\.333\.0\\pm 0\.3\+3\.726\.8±1\.226\.8\\pm 1\.2\+8\.6400±15400\\pm 158\.0±0\.28\.0\\pm 0\.22\.502\.50H=3H=323\.883\.6±0\.583\.6\\pm 0\.5\+5\.333\.7±0\.233\.7\\pm 0\.2\+4\.428\.7±1\.428\.7\\pm 1\.4\+10\.5420±15420\\pm 158\.2±0\.28\.2\\pm 0\.22\.632\.63H=4H=423\.883\.2±0\.483\.2\\pm 0\.4\+4\.933\.4±0\.333\.4\\pm 0\.3\+4\.128\.1±1\.328\.1\\pm 1\.3\+9\.9440±16440\\pm 168\.4±0\.28\.4\\pm 0\.22\.382\.38H=5H=523\.882\.8±0\.582\.8\\pm 0\.5\+4\.533\.1±0\.333\.1\\pm 0\.3\+3\.827\.5±1\.227\.5\\pm 1\.2\+9\.3460±17460\\pm 178\.6±0\.28\.6\\pm 0\.22\.172\.17Architecture Ablation \(Latent Dimensiondd\)d=256d=25618\.582\.1±0\.582\.1\\pm 0\.5\+3\.832\.8±0\.332\.8\\pm 0\.3\+3\.526\.5±1\.226\.5\\pm 1\.2\+8\.3380±14380\\pm 147\.2±0\.27\.2\\pm 0\.22\.782\.78d=512d=51223\.883\.6±0\.583\.6\\pm 0\.5\+5\.333\.7±0\.233\.7\\pm 0\.2\+4\.428\.7±1\.428\.7\\pm 1\.4\+10\.5420±15420\\pm 158\.2±0\.28\.2\\pm 0\.22\.632\.63d=768d=76828\.983\.8±0\.483\.8\\pm 0\.4\+5\.533\.9±0\.233\.9\\pm 0\.2\+4\.629\.1±1\.329\.1\\pm 1\.3\+10\.9480±18480\\pm 189\.8±0\.29\.8\\pm 0\.22\.082\.08d=1024d=102435\.284\.0±0\.384\.0\\pm 0\.3\+5\.734\.1±0\.234\.1\\pm 0\.2\+4\.829\.5±1\.229\.5\\pm 1\.2\+11\.3520±20520\\pm 2011\.5±0\.311\.5\\pm 0\.31\.921\.92Attention Head Ablation4 heads20\.782\.8±0\.582\.8\\pm 0\.5\+4\.533\.2±0\.333\.2\\pm 0\.3\+3\.927\.8±1\.227\.8\\pm 1\.2\+9\.6380±14380\\pm 147\.8±0\.27\.8\\pm 0\.22\.782\.788 heads23\.883\.6±0\.583\.6\\pm 0\.5\+5\.333\.7±0\.233\.7\\pm 0\.2\+4\.428\.7±1\.428\.7\\pm 1\.4\+10\.5420±15420\\pm 158\.2±0\.28\.2\\pm 0\.22\.632\.6316 heads28\.983\.9±0\.483\.9\\pm 0\.4\+5\.633\.8±0\.233\.8\\pm 0\.2\+4\.528\.9±1\.328\.9\\pm 1\.3\+10\.7460±17460\\pm 178\.8±0\.28\.8\\pm 0\.22\.382\.3832 heads38\.284\.1±0\.384\.1\\pm 0\.3\+5\.834\.0±0\.234\.0\\pm 0\.2\+4\.729\.2±1\.229\.2\\pm 1\.2\+11\.0520±20520\\pm 209\.5±0\.29\.5\\pm 0\.22\.082\.08Cross\-Scale Consistency \(Different Model Sizes\)Prompt\-as\-Planning \(Base\)23\.883\.6±0\.583\.6\\pm 0\.5\+5\.333\.7±0\.233\.7\\pm 0\.2\+4\.428\.7±1\.428\.7\\pm 1\.4\+10\.5420±15420\\pm 158\.2±0\.28\.2\\pm 0\.22\.632\.63Prompt\-as\-Planning \(Large\)45\.284\.2±0\.484\.2\\pm 0\.4\+5\.934\.3±0\.234\.3\\pm 0\.2\+5\.029\.8±1\.329\.8\\pm 1\.3\+11\.6480±18480\\pm 189\.8±0\.29\.8\\pm 0\.22\.082\.08Prompt\-as\-Planning \(XL\)78\.984\.8±0\.384\.8\\pm 0\.3\+6\.534\.9±0\.234\.9\\pm 0\.2\+5\.630\.5±1\.230\.5\\pm 1\.2\+12\.3520±20520\\pm 2011\.5±0\.311\.5\\pm 0\.31\.921\.92Prompt\-as\-Planning \(XXL\)125\.385\.2±0\.385\.2\\pm 0\.3\+6\.935\.3±0\.235\.3\\pm 0\.2\+6\.031\.1±1\.131\.1\\pm 1\.1\+12\.9580±22580\\pm 2213\.8±0\.313\.8\\pm 0\.31\.721\.72##### Key Ablation Insights\.

\(1\)Component Synergy: The combination of all components provides significantly better performance than individual components, with synergistic effects of 1\.5\-2\.1% improvement\. \(2\)Planning Horizon: Optimal performance is achieved withH=3H=3, balancing exploration and exploitation\. \(3\)Architecture Scaling: Performance improves with model size but with diminishing returns beyond XL scale\. \(4\)Parameter Efficiency: The base configuration achieves the best performance\-to\-parameter ratio, making it suitable for deployment\.

### B\.3Advanced Ablation Analysis

##### Component Interaction Analysis\.

We analyze how different components interact and their synergistic effects:

Table 10:Component interaction analysis showing synergistic effects\.Component PairIndividual GainCombined GainSynergyLatent Model \+ Reward\+1\.8%\+3\.1%\+1\.3%Latent Model \+ Multi\-scale\+2\.5%\+4\.2%\+1\.7%Reward \+ Planning\+2\.9%\+4\.8%\+1\.9%Multi\-scale \+ Planning\+3\.8%\+5\.3%\+1\.5%All Components–\+5\.3%\+2\.1%

## Appendix CAnalysis and Discussion

##### Trajectory Smoothness and Interpretability\.

We visualize latent planning trajectories across multiple reasoning tasks\. As shown in Figure[3](https://arxiv.org/html/2605.28842#S3.F3), our method generates smooth, monotonic edits in a proximity\-preserving space\. Most successful trajectories involve initial large\-scale restructuring \(e\.g\., reasoning step rewriting or logical flow reordering\), followed by fine\-grained token tuning\. These patterns mirror how humans revise reasoning chains\.

##### Edit Pattern Reusability\.

We test whether edit sequences learned on one reasoning chain can be reused on others\. We find that high\-reward edit patterns \(e\.g\., intermediate conclusion insertion \+ logical connector update\) transfer across reasoning tasks with only minor adaptation, indicating that the planner captures structural reasoning chain heuristics\.

##### Latent Reward Generalization\.

The learned utility predictorR^ψ\\hat\{R\}\_\{\\psi\}maintains ranking consistency across unseen reasoning tasks\. Kendall\-τ\\taucorrelation between predicted and actual reasoning task rewards on zero\-shot reasoning chains remains above 0\.68, validating the generality of the learned latent reward landscape\.

##### Failure Modes\.

We identify two major failure cases: \(1\) reasoning chains that require global semantic transformation \(e\.g\., convert direct answer to step\-by\-step reasoning\), and \(2\) reasoning tasks with ambiguous reward signals\. In these cases, the planner may converge to suboptimal edits due to lack of latent dynamics supervision\.

##### Planning Horizon Sensitivity\.

We vary the rollout horizonHHfrom 1 to 5\. Performance peaks atH=3H=3for most tasks\. Short horizons lead to greedy edits; long horizons introduce compounding prediction noise\. This suggests latent planning works best when edits are semi\-local\.

## Appendix DSystem Efficiency Analysis

Table 11:Detailed efficiency analysis of Thoughts\-as\-Planning components\.ComponentTraining TimeParametersInference LatencyLatent World Model4\.2 hours12\.5M2\.3msReward Predictor1\.8 hours8\.2M1\.1msEdit Planner0\.5 hours3\.1M0\.8msTotal System6\.5 hours23\.8M4\.2msTable 12:Overall cost and efficiency comparison\.Method\#QueriesTime \(s\)Cost \($\)PromptGen18092\.30\.72RLPrompt15080\.10\.60Thoughts\-as\-Planning4724\.50\.19##### Efficiency Insights\.

\(1\)Training overheadis amortized over multiple reasoning chain instances, making the method cost\-effective for production use\. \(2\)Inference latencyis minimal \(4\.2ms\), enabling real\-time reasoning chain optimization\. \(3\)Query reductionof 70% translates to significant cost savings in API\-based deployments\. \(4\)Memory footprintis reasonable \(23\.8M parameters\), suitable for edge deployment\.

## Appendix EImplementation Details

### E\.1Model Architecture Details

Table 13:Model architecture specificationsComponentArchitectureInput DimOutput DimLatent Encoderhϕh\_\{\\phi\}4\-layer Transformer\(x,p\)\(x,p\)ℝ512\\mathbb\{R\}^\{512\}Latent TransitionT^θ\\hat\{T\}\_\{\\theta\}MLP \+ Residual𝐳t∈ℝ512\\mathbf\{z\}\_\{t\}\\in\\mathbb\{R\}^\{512\}𝐳t\+1∈ℝ512\\mathbf\{z\}\_\{t\+1\}\\in\\mathbb\{R\}^\{512\}Reward EstimatorR^ψ\\hat\{R\}\_\{\\psi\}2\-layer MLP𝐳t∈ℝ512\\mathbf\{z\}\_\{t\}\\in\\mathbb\{R\}^\{512\}R^ψ​\(𝐳t\)∈ℝ\\hat\{R\}\_\{\\psi\}\(\\mathbf\{z\}\_\{t\}\)\\in\\mathbb\{R\}
### E\.2Edit Action Space

Table 14:Reasoning chain edit action types and parametersScaleOperationsParameter FormatToken\-leveladd, delete, replace\(token,type,target\)\(\\text\{token\},\\text\{type\},\\text\{target\}\)Span\-levelmove, paraphrase, insert example\(step,type,target\)\(\\text\{step\},\\text\{type\},\\text\{target\}\)Block\-levelreplace instruction, reorder examples\(structure,type,target\)\(\\text\{structure\},\\text\{type\},\\text\{target\}\)
### E\.3Computational Resources

ComponentTraining TimeMemory UsageInference TimeLatent Encoder2\.1 hours2\.3GB1\.2msTransition Model3\.8 hours3\.1GB0\.8msReward Predictor1\.2 hours1\.8GB0\.3msPlanning Module0\.5 hours0\.9GB2\.1msTotal System7\.6 hours8\.1GB4\.4msTable 15:Computational resource requirements for training and inference\.
### E\.4Statistical Significance Testing

We conduct statistical significance tests to validate our experimental results\. For each comparison, we use paired t\-tests with Bonferroni correction for multiple comparisons\.

ComparisonMethod 1Method 2p\-valueSuperNIThoughts\-as\-PlanningRLPromptp<0\.001p<0\.001XSumThoughts\-as\-PlanningPromptGenp<0\.001p<0\.001HumanEvalThoughts\-as\-PlanningAutoPromptp<0\.001p<0\.001Table 16:Statistical significance tests for main experimental comparisons\.
### E\.5Reproducibility

All experiments are conducted with fixed random seeds for reproducibility\. The complete experimental setup, including hyperparameters, model architectures, and training procedures, is documented in this appendix\. Code is attached in supplementary material\.

ParameterValueWorld model epochs50Planning horizonHH3Planning stepsTT8Reward loss weightλrew\\lambda\_\{\\text\{rew\}\}1\.0Latent dimensiondd512Learning rate1e\-4Batch size32Candidate actionsKK10Temperatureτ\\tau0\.1Table 17:Training hyperparameters\.

Similar Articles

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

arXiv cs.CL

This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.