LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

arXiv cs.CL 05/11/26, 04:00 AM Papers
Summary
The paper introduces LaTER, a two-stage reasoning paradigm that combines latent exploration with explicit Chain-of-Thought verification to reduce token usage and improve efficiency in large language models without sacrificing accuracy.
arXiv:2605.07315v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 07:00 AM
# Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
Source: [https://arxiv.org/html/2605.07315](https://arxiv.org/html/2605.07315)
Xuan Li1Yining Wang2Yuchen LiuGuanjun Liu1Delai Qiu2 Shengping Liu2Jiaen Liang2Wei Huang2Jun Yu1222Corresponding author\.Junnan Zhu3222Corresponding author\. 1University of Science and Technology of China, 2Unisound AI Technology Co\., Ltd, 3MAIS, Institute of Automation, Chinese Academy of Sciences harryjun@ustc\.edu\.cn, junnan\.zhu@nlpr\.ia\.ac\.cn

###### Abstract

Chain\-of\-thought \(CoT\) reasoning improves large language models \(LLMs\) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token\. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking\. We proposeLatent\-Then\-ExplicitReasoning \(LaTER\), a two\-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation\. In a training\-free instantiation, LaTER projects final\-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model\-native stop\-token probes to decide when to switch\. We find that strong reasoning models already exhibit structured latent trajectories under this interface\. On Qwen3\-14B, training\-free LaTER reduces total token usage by 16%–32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70\.0% to 73\.3% while reducing tokens from 15,730 to 10,661\. We further constructLatent\-Switch\-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations\. Fine\-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80\.0% accuracy on AIME 2025, 10\.0 points above the standard CoT baseline, while using 33% fewer tokens\. Our code, data, and model are available at[https://github\.com/TioeAre/LaTER](https://github.com/TioeAre/LaTER)\.

## 1Introduction

CoT prompting is a simple and effective way to improve reasoning in LLMs\[[20](https://arxiv.org/html/2605.07315#bib.bib1)\]\. By generating intermediate derivations before the final answer, CoT improves performance on mathematics, science, and code tasks\[[7](https://arxiv.org/html/2605.07315#bib.bib2)\]\. Its main drawback is cost\. Strong reasoning models often produce long visible traces, and each additional token increases latency, memory traffic, and attention computation\[[19](https://arxiv.org/html/2605.07315#bib.bib3)\]\. The cost is especially high when the model spends many tokens on tentative exploration, syntactic scaffolding, or discarded solution paths before reaching a stable derivation\.

Recent work therefore studies reasoning in a continuous latent space\[[8](https://arxiv.org/html/2605.07315#bib.bib4),[24](https://arxiv.org/html/2605.07315#bib.bib6)\]\. Instead of sampling a visible token at every reasoning step, a model can feed back a hidden state or a soft embedding as the next input, using either an analytic mapping such as a pseudo\-inverse projection\[[28](https://arxiv.org/html/2605.07315#bib.bib5)\]or a learned projector\[[21](https://arxiv.org/html/2605.07315#bib.bib27)\], and only decodes discrete readable tokens in the final answer stage\. This can substantially reduce visible token generation and has shown promising efficiency gains\[[8](https://arxiv.org/html/2605.07315#bib.bib4),[23](https://arxiv.org/html/2605.07315#bib.bib9),[27](https://arxiv.org/html/2605.07315#bib.bib8)\]\. However, pure latent reasoning also has a clear weakness: when a problem requires careful symbolic manipulation, explicit checking, or exact answer formatting, fully replacing CoT with latent computation can reduce accuracy on difficult benchmarks such as MATH\-500 and AIME\[[5](https://arxiv.org/html/2605.07315#bib.bib10),[18](https://arxiv.org/html/2605.07315#bib.bib7),[16](https://arxiv.org/html/2605.07315#bib.bib11)\]\.

This suggests that latent and discrete reasoning should not be viewed as mutually exclusive alternatives\. A more natural division of labor is to use continuous computation for early exploration and reserve discrete tokens for verification\. Human solvers often behave in a similar way: they may first search mentally for a plan and only later write a step\-by\-step solution\. We use this analogy only as motivation for a computational design\. The central question is whether an LLM can spend part of its test\-time computation in a high\-bandwidth latent state, then return to explicit CoT when precise symbolic reasoning is most valuable\.

We proposeLaTER\(Latent\-Then\-ExplicitReasoning\), a hybrid reasoning paradigm that separates exploration from verification\. Given a prompt, LaTER first performs a bounded latent rollout\. At each latent step, the final\-layer hidden state is mapped back into the input embedding space and reused as the next input, without committing to a visible token\. The model then switches to ordinary token generation while preserving the latent\-phase KV cache, so the explicit derivation is conditioned on the preceding latent trajectory rather than starting from scratch\.

We study LaTER in two settings\. First, we show that no additional training is required for the interface to be useful\. A training\-free version uses a simple adaptive switch based on latent entropy and decoded stop\-token probes\. On Qwen3\-14B, this already improves AIME 2025 from 70\.0% to 73\.3% while reducing average token usage from 15,730 to 10,661, and improves MATH\-500 from 93\.4% to 97\.2% with 17% fewer tokens\. Second, we train a LaTER model onLatent\-Switch\-69K, a dataset designed to teach the model how to allocate latent exploration before explicit reasoning\. The trained model reaches 80\.0% on AIME 2025, a 10\.0\-point gain over the standard CoT baseline, while using 33% fewer tokens\.

Our contributions are threefold\. \(i\) We introduce a latent\-then\-explicit reasoning interface that preserves the latent KV cache and turns latent computation into a precursor to explicit verification\. \(ii\) We identify training\-free latent switching signals, including terminating\-token probes and entropy dynamics, showing that pretrained reasoning models can already support structured latent rollouts\. \(iii\) We constructLatent\-Switch\-69Kand train a LaTER model that improves the accuracy–efficiency tradeoff across mathematics, coding, and knowledge\-intensive reasoning benchmarks\.

## 2Training\-Free LaTER

We first ask whether a pretrained reasoning model can benefit from a latent\-first, explicit\-second procedure without any task\-specific training\. This setting isolates the inference\-time interface from supervised adaptation\. We show that strong reasoning models can perform several continuous latent steps, retain those steps in the KV cache, and then convert the accumulated state into explicit CoT with lower token usage\. We also show that fixed latent horizons are brittle, motivating adaptive switching based on the model’s own latent dynamics\.

### 2\.1Preliminaries and notation\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/x1.png)Figure 1:Overview of training\-free LaTER\.Given a user prompt, the model first enters a latent reasoning phase, where the final\-layer hidden state is projected back into the input embedding space and reused as the next\-step input, without committing to visible tokens\. The model then switches to explicit CoT decoding, reusing the latent KV cache to generate reasoning steps and the final answer\.LetQ=\(Q1,…,Qm\)Q=\(Q\_\{1\},\\ldots,Q\_\{m\}\)denote the prompt\. At latent stepss, the model produces a final\-layer hidden statehs∈ℝdhh\_\{s\}\\in\\mathbb\{R\}^\{d\_\{h\}\}\. Instead of decodinghsh\_\{s\}to a token ID and feeding that token back into the model, we maphsh\_\{s\}directly into the input embedding space\. Following the latent\-transition construction of LatentMAS\[[28](https://arxiv.org/html/2605.07315#bib.bib5)\], we use

es\+1lat=Wahs,Wa≈Wout†Win,e\_\{s\+1\}^\{\\mathrm\{lat\}\}=W\_\{a\}h\_\{s\},\\qquad W\_\{a\}\\approx W\_\{out\}^\{\\dagger\}W\_\{in\},\(1\)whereWinW\_\{in\}is the input embedding matrix,WoutW\_\{out\}is the output projection matrix, andWout†W\_\{out\}^\{\\dagger\}denotes the pseudo\-inverse ofWoutW\_\{out\}\. The vectores\+1late\_\{s\+1\}^\{\\mathrm\{lat\}\}is then used as the next\-step input embedding\. This produces a continuous trajectory

h1→e2lat→h2→e3lat→⋯→hS,h\_\{1\}\\rightarrow e\_\{2\}^\{\\mathrm\{lat\}\}\\rightarrow h\_\{2\}\\rightarrow e\_\{3\}^\{\\mathrm\{lat\}\}\\rightarrow\\cdots\\rightarrow h\_\{S\},\(2\)with no discrete token commitment at the intermediate latent positions\. For diagnostics only, we decode each latent hidden state into a probe distribution and an argmax probe token,

ps=softmax\(Wouths\),y^s=arg⁡maxi⁡\(ps\(i\)\),p\_\{s\}=\\mathrm\{softmax\}\(W\_\{out\}h\_\{s\}\),\\qquad\\hat\{y\}\_\{s\}=\\arg\\max\_\{i\}\(p\_\{s\}\(i\)\),\(3\)The probe tokeny^s\\hat\{y\}\_\{s\}is never used as the next input\. It is only an observation of how the latent state aligns with the model’s vocabulary space\. We also compute the entropy of the probe distribution,

ℋs=−∑ips\(i\)log⁡ps\(i\)\.\\mathcal\{H\}\_\{s\}=\-\\sum\_\{i\}p\_\{s\}\(i\)\\log p\_\{s\}\(i\)\.\(4\)which provides a scalar summary of the model’s uncertainty at that latent step\.

After the latent rollout, LaTER switches to ordinary explicit CoT decoding\. The switch is not a reset: we pass the latent\-phasepast\_key\_valuesinto the explicit phase, so the generated derivation conditions on the latent trajectory\. We evaluate two switching policies:

- •Fixed\-step switching\.The model performsNNlatent steps and then enters explicit CoT decoding\.
- •Adaptive switching\.The model exits latent reasoning when either the entropy crosses a threshold, the decoded probe token belongs to a model\-specific set of terminating tokens such as<\|im\_end\|\>,</think\>, or<\|endoftext\|\>\.

Formally, the adaptive switch is

switch\(s\)=𝟏\[ℋs\>τℋ∨y^s∈𝒯stop\],\\mathrm\{switch\}\(s\)=\\mathbf\{1\}\\\!\\left\[\\mathcal\{H\}\_\{s\}\>\\tau\_\{\\mathcal\{H\}\}\\;\\;\\lor\\;\\;\\hat\{y\}\_\{s\}\\in\\mathcal\{T\}\_\{\\mathrm\{stop\}\}\\right\],\(5\)whereτℋ\\tau\_\{\\mathcal\{H\}\}is an entropy threshold and𝒯stop\\mathcal\{T\}\_\{\\mathrm\{stop\}\}is the terminating\-token set\. The next subsection explains why these two signals are empirically meaningful\.

### 2\.2Empirical motivation: latent trajectories are structured

A concern with latent reasoning is that hidden states might drift away from the vocabulary manifold, making repeated latent transitions unstable or semantically meaningless\. Our experiments suggest a more structured picture for reasoning models such as Qwen3\-14B\[[22](https://arxiv.org/html/2605.07315#bib.bib12)\], DeepSeek\-R1\-Distill\-Llama\-8B\[[7](https://arxiv.org/html/2605.07315#bib.bib2)\], and OLMo3\-32B\-Think\[[13](https://arxiv.org/html/2605.07315#bib.bib13)\]\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/images/sentence_entropy_trend.png)Figure 2:Entropy over normalized reasoning progress on AIME 2025 for Qwen3\-14B\. Blue: mean latent\-reasoning entropy after aligning each example from latent start to end\. Red: mean CoT entropy after normalizing each sentence by within\-sentence progress\.Phenomenon 1: probe tokens reveal autoregressive stopping structure\.Early latent states often decode to low\-content probes, such as empty strings or repeated newline symbols \("\\n\\n"\)\. After additional latent steps, however, the argmax probe frequently reaches model\-native terminating symbols such as<\|im\_end\|\>,</think\>, or<\|endoftext\|\>\. These probe tokens are not fed back into the model, so they do not drive the rollout\. Their appearance instead indicates that the continuous trajectory remains coupled to the model’s generative prior\. In this sense, latent reasoning does not behave like arbitrary numerical drift; it often approaches states that the language model itself would interpret as closure\.

This observation is central to LaTER\. If the model internally approaches a state that resembles “ready to stop”, then switching to explicit CoT can be aligned with the model’s own trajectory and the reasoning patterns it acquired during pretraining, rather than imposed at an unrelated time\.

Phenomenon 2: entropy supports an explore\-then\-verify interpretation\.As shown in Figure[2](https://arxiv.org/html/2605.07315#S2.F2), the average entropy during latent rollout tends to rise over normalized latent progress before termination\. This differs from ordinary explicit decoding, where entropy is often locally high near the beginning of a sentence and then declines as syntax and previously generated words constrain the continuation\. The latent phase therefore appears to support a broader and less locally constrained search, while the later explicit phase converts the accumulated state into a step\-by\-step derivation\.

We do not claim that entropy alone fully explains latent reasoning\. Rather, these two observations provide practical switching signals: the terminating\-token probe suggests that the trajectory is approaching closure, and the entropy profile indicates when the latent state is entering a high\-uncertainty regime\. Together they motivate the adaptive rule in Eq\.[5](https://arxiv.org/html/2605.07315#S2.E5)\.

### 2\.3Training\-free experimental setup

We compare standard discrete CoT decoding with training\-free LaTER under the same prompts and decoding settings\. We report accuracy and total token usage\. For LaTER, token usage counts both latent steps and emitted explicit tokens, so reductions are not an artifact of ignoring latent computation\. We evaluate Qwen3\-14B, DeepSeek\-R1\-Distill\-Llama\-8B, and OLMo3\-32B\-Think on AIME 2025\[[1](https://arxiv.org/html/2605.07315#bib.bib14)\], MATH\-500\[[9](https://arxiv.org/html/2605.07315#bib.bib15)\], GSM8K\[[3](https://arxiv.org/html/2605.07315#bib.bib16)\], GPQA\[[15](https://arxiv.org/html/2605.07315#bib.bib17)\], ARC\-Challenge\[[2](https://arxiv.org/html/2605.07315#bib.bib18)\], HumanEval\+, and MBPP\+\[[10](https://arxiv.org/html/2605.07315#bib.bib19),[11](https://arxiv.org/html/2605.07315#bib.bib20)\]\.

### 2\.4Fixed\-steps switching results

For Qwen3\-14B, we follow the official decoding recommendations: temperature=0\.6=0\.6, top\-p=0\.95p=0\.95, top\-k=20k=20, andmax\_new\_tokens=38192=38192\. Under this setup, the standard discrete CoT baseline reaches 70\.0% accuracy on AIME 2025 with roughly16K16\\mathrm\{K\}tokens on average\. Figure[3](https://arxiv.org/html/2605.07315#S2.F3)shows that fixed\-step LaTER can reduce token usage substantially, but it does not fully match the baseline accuracy\. The best fixed horizons, around 50–60 latent steps, reach 63\.3% accuracy with about10K10\\mathrm\{K\}–12K12\\mathrm\{K\}total tokens\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/images/latent_reasoning_combo_chart.png)Figure 3:Accuracy and token usage on AIME 2025 as the fixed latent\-step budget varies\.The fixed\-step curve is non\-monotonic: performance first improves as the latent budget increases, then degrades when the return to explicit reasoning is delayed too long\. This pattern supports the role separation behind LaTER\. Latent exploration is useful up to a point, but difficult problems still benefit from an explicit symbolic phase that checks intermediate conclusions and formats the final answer\. A single fixed horizon cannot adapt to instance difficulty, which motivates the adaptive switch\.

### 2\.5Adaptive switching results

Adaptive LaTER uses the same decoding configuration as above but replaces the fixed latent horizon with Eq\.[5](https://arxiv.org/html/2605.07315#S2.E5)\. At each latent step, we monitor the entropy of the probe distribution and the argmax probe token\. The model switches to explicit CoT once the entropy exceeds77or the probe token becomes a terminating symbol such as<\|im\_end\|\>\. Table[1](https://arxiv.org/html/2605.07315#S2.T1)shows that this simple policy usually reduces token usage and often preserves or improves accuracy relative to the paired CoT baseline\. These gains are particularly evident in stronger models and on tasks requiring extended reasoning, achieve greater token savings without compromising solution quality\. The effect is strongest for Qwen3\-14B: adaptive LaTER improves AIME 2025 from 70\.0% to 73\.3% while reducing tokens from 15,730 to 10,661, and it improves MATH\-500 from 93\.4% to 97\.2% while reducing tokens by 17%\.

Figure[5](https://arxiv.org/html/2605.07315#S2.F5)gives a qualitative view on one AIME 2025 example\. In the dominant PC1–PC2 plane, the discrete CoT trajectory is relatively scattered, suggesting that the model is still searching for a solution path\. LaTER first follows a compact latent trajectory\. After switching, its explicit trajectory forms repeated refinements along a shared direction rather than spreading randomly\. This visualization is not a proof of mechanism, but it is consistent with the hypothesis that latent exploration organizes the state before explicit verification\.

Table 1:Training\-free LaTER with adaptive switching across seven benchmarks, compared with the corresponding discrete CoT baseline for each backbone\. Token counts include latent steps plus emitted explicit tokens\. Green cells mark accuracy gains over the paired baseline, blue cells mark token reductions, and red cells mark token increases\.![Refer to caption](https://arxiv.org/html/2605.07315v1/images/entropy_threshold_combo_chart.png)Figure 4:Effect of the entropy thresholdτℋ\\tau\_\{\\mathcal\{H\}\}on training\-free adaptive LaTER for Qwen3\-14B on AIME 2025Why does adaptive switching help?A fixed horizon gives every instance the same latent budget, regardless of difficulty or internal confidence\. Figures[3](https://arxiv.org/html/2605.07315#S2.F3)and Figures[4](https://arxiv.org/html/2605.07315#S2.F4)together show that neither too few nor too many latent steps are desirable\. If the latent phase is too short, the model leaves latent exploration before the hidden state is sufficiently organized, so the later explicit CoT cannot fully benefit\. If the latent phase is too long, latent computation begins to replace useful explicit verification, which hurts accuracy and can also weaken the overall efficiency–accuracy tradeoff\. The key is therefore to find a balanced exit point where latent exploration is sufficient but not excessive\. Adaptive switching aims to approximate this balance by using the model’s own latent dynamics—including probe entropy and terminating\-token probes—to decide when to exit according to the difficulty of the current problem and the model’s internal confidence\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/images/pca_trajectory_figure.png)Figure 5:PCA trajectories of Qwen3\-14B on an AIME 2025 example under adaptive switching\. The plotted coordinates are the first six PCA components of the final\-layer hidden states\. Blue denotes the latent trajectory, and red denotes the first 256 steps of the explicit CoT trajectory\. Color intensity increases from light to dark as reasoning progresses\.
### 2\.6Failure cases and the limit of hand\-crafted switching

The training\-free results also reveal an important limitation\. On some AIME 2025 problems, latent entropy remains low and the discrete CoT baseline already solves the problem correctly; adding latent steps can still hurt\. Low entropy therefore does not by itself imply that the model should remain in, or exit from, the latent phase\. Appendix[A](https://arxiv.org/html/2605.07315#A1)then analyzes sample\-level entropy trajectories\. A key finding is that the model does not use the same latent entropy scale for every problem\. Instead, it appears to selectively amplify the entropy of the latent phase depending on problem difficulty and internal confidence\. By contrast, in ordinary CoT decoding, the peak entropy and overall entropy range are much more similar across samples\. In the latent phase, however, different samples can show more than an order\-of\-magnitude difference in entropy scale\.

This failure mode has two implications\. First, a single global entropy threshold is too coarse: different problems can require different latent horizons even when their entropy values appear similar\. Second, the model should not only be monitored during latent reasoning; it should learn when to stop\. The training\-free setting establishes that useful switching signals exist, but it also shows that hand\-crafted rules only approximate the ideal decision boundary\. This motivates the supervised LaTER training procedure in the next section\.

## 3Training LaTER

The training\-free study shows that pretrained models can use latent rollouts, but it also shows that hand\-crafted switching is limited\. We therefore train a Qwen3\-14B model to use a latent segment before explicit reasoning\. The trained system differs from the training\-free version in two ways: it replaces the pseudo\-inverse mapping with a learned projector, and it receives supervision that teaches the model how long the latent segment should be\.

### 3\.1Model architecture

We extend the tokenizer with two boundary tokens,<latent\_think\>and</latent\_think\>\. The embedding layer, transformer backbone, and language\-modeling head keep their original architecture, and we add a lightweight projectorgϕg\_\{\\phi\}that maps decoder hidden states back into the token embedding space\. During latent reasoning, the model computes

ht=fθ\(et,𝒞<t\),et\+1=gϕ\(ht\),h\_\{t\}=f\_\{\\theta\}\(e\_\{t\},\\mathcal\{C\}\_\{<t\}\),\\qquad e\_\{t\+1\}=g\_\{\\phi\}\(h\_\{t\}\),\(6\)wherefθf\_\{\\theta\}is the transformer,ete\_\{t\}is the current latent input embedding, and𝒞<t\\mathcal\{C\}\_\{<t\}is the causal context\. This recurrence updates the model’s internal reasoning state without emitting a visible token at each latent position\. The supervised assistant format is:

<latent\_think\>l1,…,lm</latent\_think\><think\>t1,…,tn</think\>a,\\texttt\{<latent\\\_think\>\}~l\_\{1\},\\ldots,l\_\{m\}~\\texttt\{</latent\\\_think\>\}~\\texttt\{<think\>\}~t\_\{1\},\\ldots,t\_\{n\}~\\texttt\{</think\>\}~a,\(7\)wherelil\_\{i\}are latent placeholder positions,tit\_\{i\}are distilled explicit CoT tokens, andaais the final answer\. The latent placeholders are not supervised with ordinary token\-level cross\-entropy \(CE\)\. Instead, their input embeddings are replaced by recurrent projector outputs, so gradients from the later explicit reasoning and answer tokens teach the model how to use them as hidden computation steps\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/images/token_and_compression_distribution_figure.png)Figure 6:Token statistics of the distilled corpus\. The figure compares original and distilled reasoning lengths and shows the resulting CoT compression ratios\.
### 3\.2Training data construction

We construct the supervised corpus from reasoning traces sampled from Dolci\-Think\-SFT\-32B\[[13](https://arxiv.org/html/2605.07315#bib.bib13)\]and distill them with a stronger reasoning teacher\. For each problem, the teacher produces a short*solution intuition*: a few sentences describing the high\-level plan without a full derivation\. The teacher then generates a shorter explicit CoT conditioned on the original problem and the solution intuition\. Each retained record contains a problem, an intuition, a compressed CoT, and a final answer\.

The latent budget is tied to the intuition length\. If the retained intuition containsLLtokens, preprocessing assigns approximatelyL/2L/2latent steps, subject to maximum\-length and tokenization constraints\. This design uses the intuition length as a proxy for how much condensed reasoning should be represented by the latent segment\.

Table 2:Statistics of the distilled SFT corpus\. The compression ratio is the distilled CoT length divided by the original CoT length\.Each training record is rendered as a two\-part assistant response\. The latent segment contains<latent\_think\>, a repeated padding placeholder, and</latent\_think\>\. The explicit segment contains<think\>, the distilled CoT,</think\>, and the answer\. We also build a teacher\-reference conversation in which the problem is paired with the distilled solution intuition and the shortened explicit reasoning trace\. This reference provides teacher KL\-distribution supervision over explicit reasoning and answer tokens\. Figure[6](https://arxiv.org/html/2605.07315#S3.F6)summarizes the resulting token counts and the compression ratio between the original and distilled chains of thought, showing that the distilled traces are substantially shorter while still preserving useful reasoning content\. Table[2](https://arxiv.org/html/2605.07315#S3.T2)gives a compact summary of the dataset\. The final training split contains 69,745 examples, with most samples in the medium\-difficulty bucket \(65\.5%\), followed by hard \(25\.0%\) and easy \(9\.5%\)\. The compression ratio has a mean of 0\.612 and a median of 0\.569, which means that the distilled CoTs keep only about 57–61% of the original reasoning length\. The curriculum metadata groups samples by difficulty so that early training can emphasize easier examples\. As detailed in Appendix[B](https://arxiv.org/html/2605.07315#A2), the underlying source mix is math\- and code\-heavy: math accounts for about 37% of examples and code for about 34%, while science\-oriented questions contribute about 5% and the remainder mainly comes from instruction\-following and general knowledge\-oriented prompts\.

### 3\.3Optimization objective

![Refer to caption](https://arxiv.org/html/2605.07315v1/x2.png)Figure 7:Overview of training pipeline\.We construct latent\-reasoning training sequences with latent and explicit reasoning segments, train the model with supervised and teacher\-matching objectives, and obtain a model that first performs latent reasoning and then switches to explicit CoT generation\.We train with a mixture of supervised language modeling, self\-distillation\[[26](https://arxiv.org/html/2605.07315#bib.bib21)\], and latent halting supervision\. Let𝒮CoT\\mathcal\{S\}\_\{\\mathrm\{CoT\}\}denote the interior explicit reasoning positions between<think\>and</think\>, and let𝒮nonCoT\\mathcal\{S\}\_\{\\mathrm\{nonCoT\}\}denote the remaining supervised response positions, including structural tags and answer tokens\. For target tokenyiy\_\{i\}, the cross\-entropy loss is

ℒCE=1\|𝒮nonCoT\|∑i∈𝒮nonCoT−log⁡pθ\(yi∣x<i\)\+λCoT1\|𝒮CoT\|∑i∈𝒮CoT−log⁡pθ\(yi∣x<i\)\\mathcal\{L\}\_\{\\mathrm\{CE\}\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\mathrm\{nonCoT\}\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\_\{\\mathrm\{nonCoT\}\}\}\-\\log p\_\{\\theta\}\(y\_\{i\}\\mid x\_\{<i\}\)\+\\lambda\_\{\\mathrm\{CoT\}\}\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\mathrm\{CoT\}\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\_\{\\mathrm\{CoT\}\}\}\-\\log p\_\{\\theta\}\(y\_\{i\}\\mid x\_\{<i\}\)\(8\)Only the interior reasoning tokens belong to𝒮CoT\\mathcal\{S\}\_\{\\mathrm\{CoT\}\}\. The boundary tags, latent boundary tokens, answer tokens, and<\|im\_end\|\>belong to𝒮nonCoT\\mathcal\{S\}\_\{\\mathrm\{nonCoT\}\}\. We setλCoT=0\.5\\lambda\_\{\\mathrm\{CoT\}\}=0\.5so that explicit reasoning supervision remains useful without overwhelming the structural and answer tokens\.

For self\-distillation, we precompute top\-kkteacher distributions withk=128k=128\. The teacher is the same Qwen3\-14B model used to initialize training, but queried with a different input format: we concatenate the original question prompt and the distilled solution intuition into the user message, then feed the distilled short CoT as the assistant continuation and record the teacher distribution at each token position in that short CoT\. On these valid teacher positions, the student minimizes a temperature\-scaled KL divergence, with temperatureT=1\.0T=1\.0and weightλKL=0\.25\\lambda\_\{\\mathrm\{KL\}\}=0\.25\.

ℒKL=1\|𝒮KL\|∑i∈𝒮KLDKL\(qi\(T\)∥pθ\(T\)\(⋅∣x<i\)\)\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\mathrm\{KL\}\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\_\{\\mathrm\{KL\}\}\}D\_\{\\mathrm\{KL\}\}\\\!\\left\(q\_\{i\}^\{\(T\)\}\\,\\\|\\,p\_\{\\theta\}^\{\(T\)\}\(\\cdot\\mid x\_\{<i\}\)\\right\)\(9\)
Finally, we train the model to terminate latent reasoning at the intended boundary\. Let𝒮latint\\mathcal\{S\}\_\{\\mathrm\{lat\}\}^\{\\mathrm\{int\}\}denote latent interior positions,𝒮lat\\mathcal\{S\}\_\{\\mathrm\{lat\}\}denote all latent positions,𝒱forbid\\mathcal\{V\}\_\{\\mathrm\{forbid\}\}be the set of forbidden structural tokens, andbi∈\{0,1\}b\_\{i\}\\in\\\{0,1\\\}indicate whetheriiis the correct stopping boundary\. The raw halt loss is

ℒhalt=1\|𝒮latint\|∑i∈𝒮latint∑v∈𝒱forbid\[zi,v−zi,max\]\+\+1\|𝒮lat\|∑i∈𝒮latBCE\(σ\(zi,</latent\_think\>\),bi\),\\mathcal\{L\}\_\{\\mathrm\{halt\}\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\mathrm\{lat\}\}^\{\\mathrm\{int\}\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\_\{\\mathrm\{lat\}\}^\{\\mathrm\{int\}\}\}\\sum\_\{v\\in\\mathcal\{V\}\_\{\\mathrm\{forbid\}\}\}\\left\[z\_\{i,v\}\-z\_\{i,\\max\}\\right\]\_\{\+\}\+\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\mathrm\{lat\}\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\_\{\\mathrm\{lat\}\}\}\\mathrm\{BCE\}\\\!\\left\(\\sigma\(z\_\{i,\\texttt\{</latent\\\_think\>\}\}\),b\_\{i\}\\right\),\(10\)wherezi,vz\_\{i,v\}is the logit of tokenvvat positionii,zi,maxz\_\{i,\\max\}is the largest logit among allowed non\-structural tokens, andσ\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. The first term penalizes forbidden structural tokens when they become too competitive before the stopping point, while the second term directly trains the model to emit</latent\_think\>exactly at the correct boundary\. The raw halt loss is assigned a base weightλhalt\(s\)=0\.025\\lambda\_\{\\mathrm\{halt\}\}^\{\(s\)\}=0\.025\. To reduce interference between stopping supervision and the main language\-modeling objective, we further modulate this term with a dynamic gate,

αt=clip\(EMA\(ℒCE\)tℒCE,t\+ϵ,0,1\),ℒhalteff=αtλhalt\(s\)ℒhalt\\alpha\_\{t\}=\\mathrm\{clip\}\\\!\\left\(\\frac\{\\mathrm\{EMA\}\(\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\)\_\{t\}\}\{\\mathcal\{L\}\_\{\\mathrm\{CE\},t\}\+\\epsilon\},0,1\\right\),\\qquad\\mathcal\{L\}\_\{\\mathrm\{halt\}\}^\{\\mathrm\{eff\}\}=\\alpha\_\{t\}\\,\\lambda\_\{\\mathrm\{halt\}\}^\{\(s\)\}\\,\\mathcal\{L\}\_\{\\mathrm\{halt\}\}\(11\)whereEMA\(ℒCE\)t\\mathrm\{EMA\}\(\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\)\_\{t\}denotes the exponential moving average of the CE loss up to optimization steptt,ℒCE,t\\mathcal\{L\}\_\{\\mathrm\{CE\},t\}is the current\-step CE loss,ϵ\\epsilonis a small constant for numerical stability, andclip\(⋅,0,1\)\\mathrm\{clip\}\(\\cdot,0,1\)truncates the gate to the interval\[0,1\]\[0,1\]\. The effective halt lossℒhalteff\\mathcal\{L\}\_\{\\mathrm\{halt\}\}^\{\\mathrm\{eff\}\}therefore applies stopping supervision mainly when it does not conflict with learning the token\-level prediction objective\. The total training loss is

ℒ=ℒCE\+λKLℒKL\+ℒhalteff\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\_\{\\mathrm\{KL\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\+\\mathcal\{L\}\_\{\\mathrm\{halt\}\}^\{\\mathrm\{eff\}\}\(12\)

### 3\.4Training setup

We use AdamW\[[12](https://arxiv.org/html/2605.07315#bib.bib24)\]with learning rate1\.0×10−71\.0\\times 10^\{\-7\}, minimum cosine learning rate1\.0×10−81\.0\\times 10^\{\-8\}, weight decay 0\.01, andβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95\. We enable FlashAttention 2\[[4](https://arxiv.org/html/2605.07315#bib.bib22)\]\. Distributed training uses DeepSpeed ZeRO\-3\[[14](https://arxiv.org/html/2605.07315#bib.bib23)\], and the launcher uses 8\*Nvidia A800 80G GPUs on a single node\.

For evaluation, we compare four systems on the same benchmark: the original Qwen3\-14B with standard explicit CoT prompting, a CoT SFT model trained on the same distilled data using only the CE and KL objectives, LaTER in the training\-free setting, and the fully trained LaTER model\. The CoT SFT baseline uses exactly the same training data and optimization strategy as trained LaTER, but removes the latent\-reasoning component and therefore does not include theℒhalteff\\mathcal\{L\}\_\{\\mathrm\{halt\}\}^\{\\mathrm\{eff\}\}\.

Table 3:Comparison on the benchmarks between different baselines\. Green cells mark the best accuracy among all methods for each benchmark, while blue cells mark the lowest token usage\.
### 3\.5Main results

Table[3](https://arxiv.org/html/2605.07315#S3.T3)compares the four Qwen3\-14B variants\. Trained LaTER achieves the lowest token usage on all seven benchmarks and the best accuracy on most of them\. On AIME 2025, it reaches 80\.0% accuracy, 10\.0 points above the standard CoT baseline, while reducing average token usage by 33%\. It also improves GSM8K, ARC\-Challenge, GPQA, HumanEval\+, and MBPP\+ relative to the baseline while using fewer tokens\.

Transcending the Accuracy\-Efficiency Trade\-Off\.A key comparison is CoT\-SFT versus trained LaTER\. CoT\-SFT benefits from the same distilled data and improves AIME 2025 from 70\.0% to 73\.3%, but it remains less accurate than trained LaTER and uses more tokens \(12,687 versus 10,575 on AIME 2025\)\. This suggests that the gains are not merely a consequence of shorter supervised traces: the latent\-first architecture contributes additional efficiency and reasoning accuracy\.

Isolating the Role of Latent Reasoning\.The results are also nuanced\. Training\-free LaTER remains the strongest method on MATH\-500 in Table[3](https://arxiv.org/html/2605.07315#S3.T3), whereas trained LaTER is more efficient and stronger on most other tasks\. This indicates that supervised latent training improves the overall accuracy–efficiency frontier but does not uniformly dominate every benchmark\. We view this as evidence that latent\-budget allocation and data mixture remain important design choices\.

## 4Related Works

Training\-free latent reasoning\.Soft Thinking and SwiReasoning\[[25](https://arxiv.org/html/2605.07315#bib.bib25),[17](https://arxiv.org/html/2605.07315#bib.bib28)\]replace hard token inputs with probability\-weighted mixtures of token embeddings, enabling latent reasoning from the model’s own next\-token distribution\. However, soft\-embedding methods can collapse toward the dominant token and thus behave similarly to greedy decoding, limiting their ability to maintain alternative reasoning paths\. SeLaR\[[6](https://arxiv.org/html/2605.07315#bib.bib26)\]addresses this issue with entropy\-gated activation, applying latent reasoning only at high\-uncertainty steps and preserving discrete decoding at deterministic steps\. LatentMAS does not use next token embedding mixtures\[[28](https://arxiv.org/html/2605.07315#bib.bib5)\]\. Instead, it projects the previous step’s hidden state back into the input embedding space and uses this latent state as the next input\. It further shares KV\-cache working memory across agents as a training\-free communication channel\. While effective, these training\-free methods still rely on hand\-crafted switching or local confidence heuristics and do not explicitly separate exploratory reasoning from rigorous derivation\.

Training\-based latent reasoning\.Coconut\[[8](https://arxiv.org/html/2605.07315#bib.bib4)\]pioneers autoregressive latent reasoning by feeding the last\-layer hidden state back as the next input embedding, showing that continuous thoughts can support implicit breadth\-first exploration\. However, its reliance on fixed latent steps and direct hidden\-state reuse exposes a mismatch between hidden states and token embeddings\. Subsequent methods improve this paradigm by learning better latent interfaces\. SoftCoT\[[21](https://arxiv.org/html/2605.07315#bib.bib27)\]reduces full\-model adaptation by using an assistant model to generate soft thoughts and a trainable projection module to align them with the target LLM\. More recent methods further refine the definition and training of latent tokens\. Latent\-SFT\[[5](https://arxiv.org/html/2605.07315#bib.bib10)\]constrains latent reasoning to the vocabulary column space and learns latent tokens with KL and CE objectives, whereas CoLaR\[[18](https://arxiv.org/html/2605.07315#bib.bib7)\]predicts compressed embedding distributions and applies reinforcement learning to encourage both diverse exploration and compact reasoning\. These methods demonstrate that latent reasoning can substantially shorten reasoning chains, but they largely aim to replace explicit CoT with latent computation\. As a result, their performance can degrade on complex tasks where precise symbolic verification is essential\.

## 5Conclusion

We introduce LaTER, a latent\-then\-explicit reasoning paradigm for reducing test\-time token cost without discarding explicit verification\. The method separates reasoning into two phases: a continuous latent rollout for early exploration, followed by discrete CoT generation for symbolic checking and final\-answer construction\. In the training\-free setting, we have found that pretrained reasoning models already exhibit structured latent trajectories, including terminating\-token probes and informative entropy dynamics\. A simple adaptive switch based on these signals reduces token usage and can improve accuracy on several benchmarks\. We then constructLatent\-Switch\-69Kand train a LaTER model with a learned latent projector and halting supervision\. The trained model improves the accuracy–efficiency tradeoff across mathematics, coding, and knowledge\-intensive benchmarks, reaching 80\.0% on AIME 2025 while using one third fewer tokens than standard CoT\.

LaTER also leaves open important questions\. The training\-free switch is still a hand\-crafted approximation, and the trained model’s behavior depends on the latent\-budget distribution and the quality of distilled intuitions\. Future work should learn richer instance\-adaptive halting policies, study longer latent exploration for open\-ended tasks, and extend latent\-then\-explicit reasoning to multimodal settings where full verbalization can be especially costly\.

## References

- \[1\]\(2025\-02\)MathArena: evaluating llms on uncontaminated math competitions\.SRI Lab, ETH Zurich\.External Links:[Link](https://matharena.ai/)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.6.4.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[2\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.11.9.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[3\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.9.7.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[4\]T\. Dao\(2023\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.External Links:2307\.08691,[Link](https://arxiv.org/abs/2307.08691)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.14.12.4.1.1),[§3\.4](https://arxiv.org/html/2605.07315#S3.SS4.p1.4)\.
- \[5\]J\. Deng, L\. Pang, Z\. Wei, S\. Xu, Z\. Duan, K\. Xu, Y\. Song, H\. Shen, and X\. Cheng\(2025\)Latent reasoning in llms as a vocabulary\-space superposition\.External Links:2510\.15522v1,[Link](https://arxiv.org/abs/2510.15522v1)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1),[§4](https://arxiv.org/html/2605.07315#S4.p2.1)\.
- \[6\]R\. Fu and G\. Luo\(2026\)SeLaR: selective latent reasoning in large language models\.External Links:2604\.08299,[Link](https://arxiv.org/abs/2604.08299)Cited by:[§4](https://arxiv.org/html/2605.07315#S4.p1.1)\.
- \[7\]D\. Guo, D\. Yang, H\. Zhang,et al\.\(2025\-09\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature\.External Links:[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.4.2.4.1.1),[§1](https://arxiv.org/html/2605.07315#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.07315#S2.SS2.p1.1)\.
- \[8\]S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. E\. Weston, and Y\. Tian\(2025\)Training large language models to reason in a continuous latent space\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Itxz7S4Ip3)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1),[§4](https://arxiv.org/html/2605.07315#S4.p2.1)\.
- \[9\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.7.5.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[10\]J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang\(2023\)Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=1qvx610Cu7)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.12.10.4.1.1),[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.13.11.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[11\]J\. Liu, S\. Xie, J\. Wang, Y\. Wei, Y\. Ding, and L\. Zhang\(2024\)Evaluating language models for efficient code generation\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=IBCBMeAhmC)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.13.11.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[12\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§3\.4](https://arxiv.org/html/2605.07315#S3.SS4.p1.4)\.
- \[13\]T\. Olmo, A\. Ettinger, A\. Bertsch,et al\.\(2026\)Olmo 3\.External Links:2512\.13961,[Link](https://arxiv.org/abs/2512.13961)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.5.3.4.1.1),[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.8.6.4.1.1),[§2\.2](https://arxiv.org/html/2605.07315#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.07315#S3.SS2.p1.1)\.
- \[14\]S\. Rajbhandari, J\. Rasley, O\. Ruwase, and Y\. He\(2020\)ZeRO: memory optimizations toward training trillion parameter models\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,SC ’20\.Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.15.13.4.1.1),[§3\.4](https://arxiv.org/html/2605.07315#S3.SS4.p1.4)\.
- \[15\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.10.8.4.1.1),[§2\.3](https://arxiv.org/html/2605.07315#S2.SS3.p1.1)\.
- \[16\]Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He\(2025\-11\)CODI: compressing chain\-of\-thought into continuous space via self\-distillation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.36/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.36)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1)\.
- \[17\]D\. Shi, A\. Asi, K\. Li, X\. Yuan, L\. Pan, W\. Lee, and W\. Xiao\(2026\)SwiReasoning: switch\-thinking in latent and explicit for pareto\-superior reasoning llms\.External Links:2510\.05069,[Link](https://arxiv.org/abs/2510.05069)Cited by:[§4](https://arxiv.org/html/2605.07315#S4.p1.1)\.
- \[18\]W\. Tan, J\. Li, J\. Ju, Z\. Luo, R\. Song, and J\. Luan\(2026\)Think silently, think fast: dynamic latent compression of LLM reasoning chains\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=AQsko3PPUe)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1),[§4](https://arxiv.org/html/2605.07315#S4.p2.1)\.
- \[19\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p1.1)\.
- \[20\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p1.1)\.
- \[21\]Y\. Xu, X\. Guo, Z\. Zeng, and C\. Miao\(2025\-07\)SoftCoT: soft chain\-of\-thought for efficient reasoning with LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria\.External Links:[Link](https://aclanthology.org/2025.acl-long.1137/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1137)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1),[§4](https://arxiv.org/html/2605.07315#S4.p2.1)\.
- \[22\]A\. Yang, A\. Li, B\. Yang, B\. Zhang,et al\.\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Table 4](https://arxiv.org/html/2605.07315#A6.T4.3.3.1.4.1.1),[§2\.2](https://arxiv.org/html/2605.07315#S2.SS2.p1.1)\.
- \[23\]X\. Yu, Z\. Chen, Y\. He, T\. Fu, C\. Yang, C\. Xu, Y\. Ma, X\. Hu, Z\. Cao, J\. Xu, G\. Zhang, J\. Tao, J\. Zhang, S\. Ma, K\. Feng, H\. Huang, Y\. Li, R\. Chen, H\. Wang, C\. Wu, Z\. Su, X\. Xu, K\. Yao, K\. Wang, C\. Gao, Y\. Liao, R\. Huang, T\. Jin, C\. Tan, J\. Zhang, W\. Ren, Y\. Fu, Y\. Liu, Y\. Wang, X\. Yue, Y\. Jiang, and S\. Yan\(2026\)The latent space: foundation, evolution, mechanism, ability, and outlook\.External Links:2604\.02029,[Link](https://arxiv.org/abs/2604.02029)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1)\.
- \[24\]Z\. Zhang, X\. He, W\. Yan, A\. Shen, C\. Zhao, and X\. E\. Wang\(2026\)Soft thinking: unlocking the reasoning potential of LLMs in continuous concept space\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=ByQdHPGKgU)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1)\.
- \[25\]Z\. Zhang, X\. He, W\. Yan, A\. Shen, C\. Zhao, and X\. E\. Wang\(2026\)Soft thinking: unlocking the reasoning potential of LLMs in continuous concept space\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=ByQdHPGKgU)Cited by:[§4](https://arxiv.org/html/2605.07315#S4.p1.1)\.
- \[26\]S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover\(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.External Links:2601\.18734,[Link](https://arxiv.org/abs/2601.18734)Cited by:[§3\.3](https://arxiv.org/html/2605.07315#S3.SS3.p1.3)\.
- \[27\]R\. Zhu, T\. Peng, T\. Cheng, X\. Qu, J\. Huang, D\. Zhu, H\. Wang, K\. Xue, X\. Zhang, Y\. Shan, T\. Cai, T\. Kergan, A\. Kembay, A\. Smith, C\. Lin, B\. Nguyen, Y\. Pan, Y\. Chou, Z\. Cai, Z\. Wu, Y\. Zhao, T\. Liu, J\. Yang, W\. Zhou, C\. Zheng, C\. Li, Y\. Zhou, Z\. Li, Z\. Zhang, J\. Liu, G\. Zhang, W\. Huang, and J\. Eshraghian\(2025\)A survey on latent reasoning\.External Links:2507\.06203,[Link](https://arxiv.org/abs/2507.06203)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1)\.
- \[28\]J\. Zou, X\. Yang, R\. Qiu, G\. Li, K\. Tieu, P\. Lu, K\. Shen, H\. Tong, Y\. Choi, J\. He, J\. Zou, M\. Wang, and L\. Yang\(2025\)Latent collaboration in multi\-agent systems\.External Links:2511\.20639,[Link](https://arxiv.org/abs/2511.20639)Cited by:[§1](https://arxiv.org/html/2605.07315#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.07315#S2.SS1.p1.5),[§4](https://arxiv.org/html/2605.07315#S4.p1.1)\.

## Appendix APer\-step entropy heterogeneity in training\-free latent reasoning

We provide a finer\-grained view of the stopping signals used by the training\-free version of LaTER\. For each AIME 2025 problem, we begin with latent reasoning and continue rolling the hidden state forward until the current hidden state can be decoded to a model\-native terminating token, such as<\|im\_end\|\>\. At every latent step, we decode the hidden state into the vocabulary space and record the entropy of the resulting predictive distribution\. This produces a trajectory\-level entropy trace for the entire latent phase, from the first latent transition to the step immediately preceding termination\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/images/latent_token_entropy_boxplot.png)Figure 8:Per\-step entropy distributions during the training\-free latent phase on AIME 2025\. For each problem, we record the entropy of the decoded vocabulary distribution at every latent step until the hidden state decodes to a terminating token such as<\|im\_end\|\>\. Each boxplot aggregates all problems that are still in the latent phase at that step position\. The wide variation in spread and upper tails shows that both the scale and the timing of peak entropy differ substantially across instances, which helps explain why a single entropy threshold cannot be uniformly optimal\.Figure[8](https://arxiv.org/html/2605.07315#A1.F8)groups the entropy traces by latent step position\. Each boxplot shows the entropy values at one step, using all problems that are still in the latent phase at that point\. The distributions vary a lot across steps and across problems\. At many step positions, the spread is wide, the upper tail changes noticeably, and the number of active trajectories also changes as some problems terminate earlier than others\. This means that the latent phase does not follow one shared entropy curve\.

The main finding is that both the size of the entropy peak and the step at which it appears differ greatly across problems\. Some problems show a sharp early peak and then settle quickly\. Others stay at relatively low entropy for many steps and peak only near the stopping point\. Still others remain broad and unstable until very late in the trajectory, suggesting that the model continues exploring for much longer\. In short, the entropy maximum is highly instance\-specific in both value and timing\.

This is why a single global threshold is only a rough stopping rule\. A threshold that works well for trajectories with large early spikes may stop too early on problems that need a longer latent phase\. A higher threshold may fit those slower cases better, but then it may wait too long on problems that are already ready to switch\. In practice, the decision should depend not only on the entropy at one step, but also on the trend of the trajectory: whether entropy is rising or falling, how long uncertainty lasts, and whether the decoded state is already close to a terminating token\.

For this reason, we view the training\-free entropy rule as a useful diagnostic rather than a complete solution\. It already captures meaningful structure in the latent dynamics and yields strong efficiency gains in the main experiments\. However, the appendix results also show that a hand\-crafted threshold cannot fully match the diversity of real latent trajectories\. A more natural next step is to learn an instance\-adaptive switching policy that uses the full trajectory, instead of relying on one fixed scalar cutoff\.

## Appendix BConstruction of Latent\-Switch\-69K

This section describes how we build the supervised corpus used to train LaTER\. Each retained example contains four parts: a user problem, a distilled solution intuition, a shortened explicit CoT, and a final answer\. The preprocessing pipeline turns this record into a latent\-supervised fine\-tuning example\. In this format, the model first passes through a bounded latent segment and then returns to ordinary explicit reasoning\. The same pipeline also builds a teacher\-reference conversation\. This allows us to combine token\-level language modeling targets with teacher\-distribution supervision on explicit reasoning and answer tokens\.

Figure[9](https://arxiv.org/html/2605.07315#A2.F9)summarizes the composition of the final corpus\. Consistent with Table[2](https://arxiv.org/html/2605.07315#S3.T2), the final training split contains 69,745 examples\. Most examples are in the medium\-difficulty bucket, which accounts for 65\.5% of the data\. Hard examples account for 25\.0%, and easy examples account for 9\.5%\. At the domain level, the source mixture is dominated by mathematical and coding data: math contributes about 37% of examples and code about 34%, while science\-oriented questions account for roughly 5%\. The remaining examples mainly come from instruction\-following and general knowledge\-oriented prompts, so the retained corpus stays centered on reasoning\-intensive tasks while preserving some diversity in format and topic\. This imbalance is intentional\. Medium\-difficulty problems provide the cleanest signal for learning when to move from latent exploration to explicit verification\. They are hard enough to require real reasoning, but usually not so noisy that distillation becomes unstable\. The hard subset broadens coverage and exposes the model to longer reasoning chains\. The easy subset helps stabilize training and preserves short\-form answer behavior\.

![Refer to caption](https://arxiv.org/html/2605.07315v1/images/training_dataset_composition_pie.png)Figure 9:Composition of the Latent\-Switch\-69K training corpus after filtering and distillation\.#### Distillation pipeline\.

We start from reasoning traces sampled from Dolci\-Think\-SFT\-32B\. We then distill them with a stronger reasoning model\. For each source problem, we first ask for a short*solution intuition*that states the high\-level plan in a few sentences\. We then ask the teacher to produce a shorter explicit CoT conditioned on the original problem and this intuition\. The final record therefore keeps both a compressed latent\-style summary and an explicit derivation\. This design lets us train the model to use continuous latent computation without giving up token\-level supervision on the visible reasoning segment\.

#### Student response template\.

For a sample withmmlatent steps, distilled CoT tokenst1:nt\_\{1:n\}, and answer tokensa1:ra\_\{1:r\}, the assistant response is written as

<latent\_think\>l1,…,lm</latent\_think\><think\>t1,…,tn</think\>a1,…,ar<\|im\_end\|\>\.\\texttt\{<latent\\\_think\>\}~l\_\{1\},\\ldots,l\_\{m\}~\\texttt\{</latent\\\_think\>\}~\\texttt\{<think\>\}~t\_\{1\},\\ldots,t\_\{n\}~\\texttt\{</think\>\}~a\_\{1\},\\ldots,~a\_\{r\}\\texttt\{<\|im\\\_end\|\>\}\.The symbolsl1,…,lml\_\{1\},\\ldots,l\_\{m\}denote latent placeholder positions\. In the current implementation, these positions are filled with a repeated placeholder token\. However, they are not trained as ordinary language targets\. During the forward pass, their token embeddings are replaced by recurrent latent states produced by the latent projector\. The model therefore learns to reason internally across these positions instead of emitting visible text there\.

#### Latent budget assignment\.

Each sample stores its latent budget as an integer field, which we denote conceptually asn\_latent\_steps\. This value determines how many placeholder positions appear between<latent\_think\>and</latent\_think\>\. In preprocessing, the budget is derived from the length of the distilled solution intuition\. If the retained intuition containsLLtokens, then the target latent budget is set to aboutL/2L/2\. This value is then clipped by the maximum latent length and other tokenization constraints\. This heuristic ties the amount of latent computation to the amount of compressed reasoning content kept by distillation\. Across the final corpus, the latent\-step count has mean 41\.49 and median 40\.00\. This choice is determined by experimental phenomena in the training\-free setting\. In section[2\.5](https://arxiv.org/html/2605.07315#S2.SS5), we found that the model naturally has the best reasoning accuracy and token efficiency after executing about 40\-50 steps of latent reasoning, so we normalized the number of latent steps in the training data to this range\.

#### Supervision masks\.

Let𝒮prompt\\mathcal\{S\}\_\{\\mathrm\{prompt\}\}denote prompt positions\. Let𝒮latint\\mathcal\{S\}\_\{\\mathrm\{lat\}\}^\{\\mathrm\{int\}\}denote latent interior positions, excluding the latent boundary markers\. The CE label at token positioniiis

yi=\{−100,i∈𝒮prompt∪𝒮latint,xi,otherwise\.y\_\{i\}=\\left\\\{\\begin\{array\}\[\]\{ll\}\-100,&i\\in\\mathcal\{S\}\_\{\\mathrm\{prompt\}\}\\cup\\mathcal\{S\}\_\{\\mathrm\{lat\}\}^\{\\mathrm\{int\}\},\\\\ x\_\{i\},&\\mbox\{otherwise\}\.\\end\{array\}\\right\.Prompt tokens and latent interior placeholders are therefore masked from ordinary token\-level CE\. By contrast, the latent boundary tokens, explicit CoT tokens, answer tokens, and the terminal<\|im\_end\|\>token remain supervised\. The preprocessing stage also builds dedicated masks for latent boundaries, explicit CoT regions, answer regions, and teacher\-KL positions\. This ensures that each objective is applied only where it is semantically meaningful\.

#### Teacher reference construction\.

For teacher\-distribution supervision, each example also includes a teacher\-reference conversation\. This reference omits the student’s latent placeholder segment and begins directly with the explicit reasoning part,

<think\>t1,…,tn</think\>a1,…,ar\.\\texttt\{<think\>\}~t\_\{1\},\\ldots,t\_\{n\}~\\texttt\{</think\>\}~a\_\{1\},\\ldots,a\_\{r\}\.Operationally, the teacher input pairs the original question with the distilled solution intuition\. The shortened explicit CoT and final answer are then treated as the continuation to be matched\. This provides teacher hidden states and teacher logits that align with explicit reasoning and answer positions\. It does not require the teacher to model the student’s continuous latent placeholders\. Teacher KL is applied only on positions selected by the teacher\-KL mask\.

Finally, the filtered corpus preserves the compression effect that motivates LaTER\. As reported in Table[2](https://arxiv.org/html/2605.07315#S3.T2), the distilled CoT compression ratio has mean 0\.612 and median 0\.569\. The visible reasoning trace is therefore typically only about 57–61% as long as the original one\. The dataset does not merely shorten responses\. It explicitly separates condensed latent planning from explicit symbolic verification\. That structural decomposition is what makes the later latent\-reasoning finetuning objective well posed\.

## Appendix CTraining Details

We initialize LaTER from a Qwen3\-14B backbone and optimize the model end to end so that it can interleave a continuous latent reasoning segment with an explicit textual reasoning segment\. Each assistant response is formatted as

<latent\_think\>l1,…,lm</latent\_think\><think\>t1,…,tn</think\>a<\|im\_end\|\>\.\\texttt\{<latent\\\_think\>\}~l\_\{1\},\\ldots,l\_\{m\}~\\texttt\{</latent\\\_think\>\}~\\texttt\{<think\>\}~t\_\{1\},\\ldots,t\_\{n\}~\\texttt\{</think\>\}~a~\\texttt\{<\|im\\\_end\|\>\}\.Herel1:ml\_\{1:m\}denote latent placeholder positions,t1:nt\_\{1:n\}denote explicit CoT tokens, andaadenotes the final answer\. During training, the latent placeholders are not treated as ordinary language targets\. Instead, their token embeddings are replaced by recurrent latent states produced by a learned latent projector\.

#### Latent forward pass\.

For each example, the model first processes the prompt and latent prefix with a cache\-based recurrent rollout\. At latent steptt, the previous hidden stateht−1h\_\{t\-1\}is mapped by the latent projector to the next latent embedding,

etlat=gϕ\(ht−1\)\.e^\{\\mathrm\{lat\}\}\_\{t\}=g\_\{\\phi\}\(h\_\{t\-1\}\)\.These projected embeddings are then written back into the full input embedding sequence\. The model subsequently performs a causal teacher\-forcing forward pass over the entire sequence, using ordinary token embeddings outside the latent interior and projected latent embeddings inside the latent segment\. During batch construction, we align both the assistant prefix and the transition into<think\>across examples, which keeps the latent\-to\-text boundary synchronized during distributed training\.

#### Supervised objectives\.

Prompt tokens and latent interior placeholders are masked from token\-level cross\-entropy\. The latent boundary tokens, explicit reasoning tokens, answer tokens, and the final<\|im\_end\|\>token remain supervised\. We decompose the CE objective into an explicit\-CoT term and a complementary non\-CoT term,

ℒCE=ℒnonCoT\+λCoTℒCoT\.\\mathcal\{L\}\_\{\\mathrm\{CE\}\}=\\mathcal\{L\}\_\{\\mathrm\{nonCoT\}\}\+\\lambda\_\{\\mathrm\{CoT\}\}\\mathcal\{L\}\_\{\\mathrm\{CoT\}\}\.In the current configuration,λCoT=0\.5\\lambda\_\{\\mathrm\{CoT\}\}=0\.5, so explicit reasoning tokens and the remaining supervised tokens contribute at the similar scale\. The structural tags<think\>and</think\>are assigned toℒnonCoT\\mathcal\{L\}\_\{\\mathrm\{nonCoT\}\}, whereasℒCoT\\mathcal\{L\}\_\{\\mathrm\{CoT\}\}contains only the interior reasoning tokens\.

#### Teacher distribution matching\.

In addition to CE, we apply cached teacher\-distribution supervision on selected explicit reasoning and answer positions\. Letqiq\_\{i\}denote the cached top\-kkteacher distribution and letpθ\(⋅∣i\)p\_\{\\theta\}\(\\cdot\\mid i\)denote the student distribution at the aligned source position\. The KL objective is

ℒKL=1\|𝒮KL\|∑i∈𝒮KLDKL\(qi∥pθ\(⋅∣i\)\)\.\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\mathrm\{KL\}\}\|\}\\sum\_\{i\\in\\mathcal\{S\}\_\{\\mathrm\{KL\}\}\}D\_\{\\mathrm\{KL\}\}\\\!\\left\(q\_\{i\}\\,\\\|\\,p\_\{\\theta\}\(\\cdot\\mid i\)\\right\)\.The current configuration uses temperature1\.01\.0and KL weight0\.250\.25\. This objective distills the teacher’s explicit reasoning and answer behavior without requiring the teacher to model the student’s continuous latent placeholders\.

Importantly, we do not apply CE or KL supervision directly inside the latent reasoning segment\. The latent placeholder positions are not trained to match token targets or teacher distributions\. Instead, the latent segment is optimized only indirectly: gradients from the downstream explicit CoT and answer tokens are back\-propagated through the latent rollout\. In this way, the model learns latent reasoning states only to the extent that they help the later explicit reasoning and final answer\.

#### Halting supervision\.

We further train the latent segment to terminate at the correct boundary with a dense auxiliary halting loss over latent interior positions\. This loss compares the logit of</latent\_think\>and other forbidden structural tokens against the best allowed non\-structural token, while also applying a BCE loss that toward the correct latent boundary\. The halting weight is annealed to a small final value of0\.0250\.025and gated by the CE quality signal so that the stopping objective does not dominate the language\-modeling objective\.

#### Optimization setup\.

The current training run uses bf16 training, FlashAttention\-2, gradient checkpointing, DeepSpeed ZeRO\-3, micro\-batch size11, and gradient accumulation44\. We set the maximum sequence length to24,09624\{,\}096, allow latent budgets of up to128128steps\.

#### Compute resources\.

All training runs are conducted on Nvidia A800 80GB GPUs\. The trained LaTER model is optimized on a single node with 8 GPUs and requires approximately 5 days for the main training run under the configuration above\. The CoT\-SFT baseline is trained on the same hardware setup and requires approximately 2 days\. All evaluation tasks are run on 2 A800 80GB GPUs\. These numbers are intended to give a practical estimate of the wall\-clock compute required to reproduce the reported training and evaluation pipeline\.

#### Evaluation protocol\.

Unless otherwise specified, each reported evaluation result is obtained by running the corresponding model\-setting pair multiple times under the same decoding configuration with a fixed random seed\.

## Appendix DPrompts

PromptTarget Question: \{question\}You are a helpful assistant\.

You must reason step\-by\-step to solve the provided Target \*\*Question\*\* without outputting other irrelevant information\.

Now, reason step by step and output the final answer inside \\boxed\{YOUR\_FINAL\_ANSWER\}\.

Figure 10:Prompt for math questions\.PromptTarget Question: \{question\}You are a helpful assistant\.

You must reason step\-by\-step to solve the provided \*\*Target Question\*\* without outputting other irrelevant information\.

Your final answer must be selected from A, B, C, D, \.\.\. For example \\boxed\{A\}\. Do not add any other contents inside the box\.

Now, reason step by step and output the final answer inside \\boxed\{YOUR\_FINAL\_ANSWER\}\.

Figure 11:Prompt for multi\-choice questions\.PromptTarget Question: \{question\}You must put all python code as self\-contained Python function\(s\) in markdown code blocks\. For example:

\`\`\`python import math def add\(a, b\): return a \+ b \`\`\`

Do not add any other contents inside the markdown code block\.

Now, reason step by step and output the final answer:

Figure 12:Prompt for code problems\.PromptYou are an expert reasoning data curator\.Extract only the key insights from the source reasoning\.

Rules: \- Return valid JSON only\. \- Do not produce a short chain of thought\. \- Do not provide any final answers in your response\. \- correct\_insight must be the coarse but correct high\-level solution idea\. \- incorrect\_insights must only include wrong ideas that are explicitly evidenced in the source outputs\. \- Do not invent errors that are not present in the source outputs\.

Figure 13:System prompt to generate solution intuition\.PromptYou will receive a problem, one or more source outputs, and optional ground truth\.Return JSON with this schema: \{ "task\_summary": "short task type description", "correct\_insight": "2\-\-10 sentences of high\-level correct plan", "incorrect\_insights": \[ \{ "idea": "2\-\-10 sentences of a wrong high\-level plan evidenced in the source outputs", "why\_wrong": "brief reason it fails" \} \], "source\_answer\_correct": true, "contains\_reflection": true \}

REMEMBER: Only extract insights that reflect on the high\-level idea\. \*\*Do not\*\* give any actual answer in correct\_insight or incorrect\_insights\.

Only give your complete plan to solve the question\. Do not directly state whether it is right or wrong\. Do not add content unrelated to the idea in correct\_insight or incorrect\_insights\.

Problem prompt:

<<<<\\\!\\\!<\\\!\\\!<PROMPT\>\>\>\>\\\!\\\!\>\\\!\\\!\>
\{prompt\}
<<<<\\\!\\\!<\\\!\\\!<END\_PROMPT\>\>\>\>\\\!\\\!\>\\\!\\\!\>Source outputs:

<<<<\\\!\\\!<\\\!\\\!<OUTPUTS\>\>\>\>\\\!\\\!\>\\\!\\\!\>
\{outputs\}
<<<<\\\!\\\!<\\\!\\\!<END\_OUTPUTS\>\>\>\>\\\!\\\!\>\\\!\\\!\>Ground truth:

<<<<\\\!\\\!<\\\!\\\!<GROUND\_TRUTH\>\>\>\>\\\!\\\!\>\\\!\\\!\>
\{ground\_truth\}
<<<<\\\!\\\!<\\\!\\\!<END\_GROUND\_TRUTH\>\>\>\>\\\!\\\!\>\\\!\\\!\>Metadata: \{metadata\_json\}

Figure 14:User prompt to generate solution intuition\.PromptYou are solving a problem from your previous intuition\.Rules: \- Continue from your intuition instead of restarting from scratch\. \- Your intuition may be correct or incorrect\. \- Follow your intuition and finish the reasoning efficiently\. \- Keep private reasoning compact and avoid repeated planning\. \- In the visible answer, provide the final answer and a brief justification only\.

Figure 15:System prompt to generate shorter CoT\.PromptSolve the following problem by continuing from your intuition\.Your intuition may be correct or incorrect\. Do not ignore it\. Continue from it and finish the solution\.

REMEMBER: The intuition comes from your previous conversation with yourself\. It is not the user’s intuition\.

Problem prompt:

<<<<\\\!\\\!<\\\!\\\!<PROMPT\>\>\>\>\\\!\\\!\>\\\!\\\!\>
\{prompt\}
<<<<\\\!\\\!<\\\!\\\!<END\_PROMPT\>\>\>\>\\\!\\\!\>\\\!\\\!\>Your Intuition:

<<<<\\\!\\\!<\\\!\\\!<INTUITION\>\>\>\>\\\!\\\!\>\\\!\\\!\>
\{insight\_text\}
<<<<\\\!\\\!<\\\!\\\!<END\_INTUITION\>\>\>\>\\\!\\\!\>\\\!\\\!\>Figure 16:User prompt to generate shorter CoT\.
## Appendix EFailure Case: Confident but Misleading Latent Reasoning

Figure[17](https://arxiv.org/html/2605.07315#A5.F17)presents a representative failure case from AIME 2025 Problem 10\. The problem asks the model to count valid fillings of a3×93\\times 9Sudoku\-band grid, where each row and each3×33\\times 3block contains the digits1,…,91,\\ldots,9exactly once\. The correct answer is8181\.

This example is notable because the latent trajectory does not appear uncertain under the entropy diagnostic\. During latent reasoning, entropy remains low, with maximum entropy only1\.6251\.625\. The highest entropy value in the CoT stage is3\.233\.23\. The latent reasoning ends after the trajectory hits<\|endoftext\|\>\. The subsequent explicit reasoning phase follows an incorrect counting path and returns6666\.

In contrast, the same backbone under standard explicit CoT decoding solves the problem correctly and returns8181\. The explicit CoT trace preserves the intermediate combinatorial bookkeeping: it fixes the first row, counts the possible second\-row block assignments, accounts for the independent within\-block permutations in the third row, and obtains9\!⋅56⋅66=21631051729\!\\cdot 56\\cdot 6^\{6\}=2^\{16\}3^\{10\}5^\{1\}7^\{2\}\. The LaTER trace enters the explicit phase after a low\-entropy latent segment, reconstructs a similar high\-level counting setup, but then treats the final block as determined once the first two blocks are chosen\. This omits an additional636^\{3\}ordering factor and yields9\!⋅56⋅63=2133751729\!\\cdot 56\\cdot 6^\{3\}=2^\{13\}3^\{7\}5^\{1\}7^\{2\}, producing the wrong answer6666\.

This case illustrates a limitation of entropy as a standalone confidence signal for latent reasoning\. Low entropy can indicate that the model is locally confident, but it does not guarantee that the latent state encodes the correct global combinatorial structure\. Here, latent reasoning appears to compress the search into a confident but flawed state, while explicit CoT preserves enough intermediate structure to recover the missing factor\.

Failure Case: AIME 2025 Problem 10Task\.The 27 cells of a3×93\\times 9grid are filled in using the numbers 1 through 9 so that each row contains 9 different numbers, and each of the three3×33\\times 3blocks heavily outlined in the example below contains 9 different numbers, as in the first three rows of a Sudoku puzzle\.\| 4 \| 2 \| 8 \| 9 \| 6 \| 3 \| 1 \| 7 \| 5 \|\| 3 \| 7 \| 9 \| 5 \| 2 \| 1 \| 6 \| 8 \| 4 \|\| 5 \| 6 \| 1 \| 8 \| 4 \| 7 \| 9 \| 2 \| 3 \|The number of different ways to fill such a grid can be written aspa⋅qb⋅rc⋅sdp^\{a\}\\cdot q^\{b\}\\cdot r^\{c\}\\cdot s^\{d\}, wherep,q,r,p,q,r,andssare distinct prime numbers anda,b,c,a,b,c,andddare positive integers\. Findp⋅a\+q⋅b\+r⋅c\+s⋅dp\\cdot a\+q\\cdot b\+r\\cdot c\+s\\cdot d\.Baseline CoT\.The standard explicit CoT run explicitly enumerates the combinatorial factors\. It first observes that the first row can be arbitrary \(9\!9\!choices\)\. Given the first row, it derives5656possible set\-level assignments for the second row and636^\{3\}within\-block orderings\. It then notes that the third row still has independent within\-block orderings, contributing another636^\{3\}factor:NCoT=9\!⋅56⋅66=2163105172\.N\_\{\\mathrm\{CoT\}\}=9\!\\cdot 56\\cdot 6^\{6\}=2^\{16\}3^\{10\}5^\{1\}7^\{2\}\.Therefore,2⋅16\+3⋅10\+5⋅1\+7⋅2=81\.2\\cdot 16\+3\\cdot 10\+5\\cdot 1\+7\\cdot 2=81\.The final prediction is correct:Pred=81,Gold=81\.\\mathrm\{Pred\}=81,\\qquad\\mathrm\{Gold\}=81\.The important qualitative point is that the explicit trace keeps the row/block assignment and ordering factors separate until the final factorization\.Latent reasoning\.The output is organized as a latent segment followed by an explicit CoT segment\. Abridged from the raw trace:⋯⏟low\-entropy latent tokensHmax=1\.625<\|endoftext\|\><think\>~\\underbrace\{\\cdots\}\_\{\\begin\{subarray\}\{c\}\\text\{low\-entropy\\ latent\\ tokens\\ \}\\\\ H\_\{\\max\}=1\.625\\end\{subarray\}\}~\\texttt\{<\|endoftext\|\>\}~\\texttt\{<think\>\}Given Block A, the number of valid Block B’s is56×21656\\times 216… Block C is uniquely determined … Therefore, the total number of grids is9\!×56×2169\!\\times 56\\times 216… Hence, the final answer is6666\.Thus the explicit CoT after the latent segment follows a superficially similar counting strategy, but it treats the third block as uniquely determined once the first two blocks are chosen\. This makes the trace useNlatent=9\!⋅56⋅63=213375172N\_\{\\mathrm\{latent\}\}=9\!\\cdot 56\\cdot 6^\{3\}=2^\{13\}3^\{7\}5^\{1\}7^\{2\}instead of9\!⋅56⋅669\!\\cdot 56\\cdot 6^\{6\}\. This gives2⋅13\+3⋅7\+5⋅1\+7⋅2=66\.2\\cdot 13\+3\\cdot 7\+5\\cdot 1\+7\\cdot 2=66\.The final prediction is incorrect:Pred=66,Gold=81\.\\mathrm\{Pred\}=66,\\qquad\\mathrm\{Gold\}=81\.Figure 17:Low\-entropy latent reasoning can still fail\. On AIME 2025 Problem 10, the baseline explicit CoT run answers correctly, whereas the LaTER run remains low\-entropy during latent reasoning but converges to an incorrect answer\. This suggests that entropy is not sufficient to certify correctness of compressed latent reasoning states\.
## Appendix FLicenses for Existing Assets

The table below summarizes the external models, datasets, and software explicitly used in this paper\. For each asset, we list the original owner, how it is used, and the released license or additional usage condition stated by the official model card, dataset card, or repository\. We use these assets for research and evaluation, cite their original sources in the reference, and do not redistribute third\-party assets outside their original release terms\.

Table 4:Existing external assets used in this paper, with their original owners and released license or usage terms\.AssetTypeUse in this paperOriginal owner / creditLicenseQwen3\-14BModelMain backbone for training\-free experiments and the initialized base model for trained LaTERQwen Team;Yanget al\.\[[22](https://arxiv.org/html/2605.07315#bib.bib12)\]Apache\-2\.0\.DeepSeek\-R1\-Distill\-Llama\-8BModelTraining\-free comparison modelDeepSeek\-AI;Guoet al\.\[[7](https://arxiv.org/html/2605.07315#bib.bib2)\]MIT\.OLMo\-3\-32B\-ThinkModelTraining\-free comparison modelAI2 / OLMo Team;Olmoet al\.\[[13](https://arxiv.org/html/2605.07315#bib.bib13)\]Apache\-2\.0\.AIME 2025 via MathArenaBenchmarkMain math evaluation benchmarkMathArena / ETH SRI Lab;Balunovićet al\.\[[1](https://arxiv.org/html/2605.07315#bib.bib14)\]MIT\.MATH\-500BenchmarkMath evaluation benchmarkOpenAI / HuggingFaceH4;Lightmanet al\.\[[9](https://arxiv.org/html/2605.07315#bib.bib15)\]MIT\.Dolci\-Think\-SFT\-32BDatasetSource dataset used to sample reasoning traces for constructingLatent\-Switch\-69KAI2 / OLMo Team;Olmoet al\.\[[13](https://arxiv.org/html/2605.07315#bib.bib13)\]odc\-by\.GSM8KDatasetMath word\-problem benchmarkOpenAI;Cobbeet al\.\[[3](https://arxiv.org/html/2605.07315#bib.bib16)\]MIT\.GPQADatasetScience QA benchmarkReinet al\.\[[15](https://arxiv.org/html/2605.07315#bib.bib17)\]MIT\.ARC\-ChallengeDatasetReasoning benchmarkAI2;Clarket al\.\[[2](https://arxiv.org/html/2605.07315#bib.bib18)\]MIT\.HumanEval\+DatasetCode\-generation benchmarkEvalPlus;Liuet al\.\[[10](https://arxiv.org/html/2605.07315#bib.bib19)\]Apache\-2\.0\.MBPP\+DatasetCode\-generation benchmarkEvalPlus;Liuet al\.\[[10](https://arxiv.org/html/2605.07315#bib.bib19),[11](https://arxiv.org/html/2605.07315#bib.bib20)\]Apache\-2\.0\.FlashAttention\-2SoftwareAttention kernel used in training implementationDao\-AILab;Dao\[[4](https://arxiv.org/html/2605.07315#bib.bib22)\]BSD\-3\-Clause\.DeepSpeed\(ZeRO\-3\)SoftwareDistributed training runtimeMicrosoft / DeepSpeed Team;Rajbhandariet al\.\[[14](https://arxiv.org/html/2605.07315#bib.bib23)\]Apache\-2\.0\.
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Submit Feedback

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners