Tag
This survey provides a comprehensive overview of latent reasoning in LLMs, exploring methods that perform multi-step inference in continuous hidden states without explicit token-level supervision.
A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.
IV-CoT decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation, using training-only sketch supervision to guide structural queries. It achieves state-of-the-art results on GenEval and T2I-CompBench.
The paper reveals that latent reasoning in transformer-based reasoning models (TRMs) functions as a policy improvement operator, and proposes an algorithm that enhances learning and inference efficiency by up to 18x.
SuperThoughts compresses consecutive chain-of-thought tokens into latent representations and decodes two tokens per step, achieving ~20–30% CoT length reduction with minimal accuracy loss on math reasoning benchmarks, while doubling inference throughput.
This paper analyzes latent reasoning models (LRMs) and demonstrates that observable patterns in latent states are not causal explanations of reasoning; it advocates for matched controls and causal tests in interpretability research.
SWITCH is a switchable latent reasoning framework that uses explicit boundary tokens to enable trainable and interpretable recurrent hidden-state reasoning via on-policy reinforcement learning, outperforming prior approaches.
This paper identifies a 'concept bottleneck' in the CoCoNuT latent reasoning paradigm where hidden states are overwritten across passes, and proposes AGCLR, which adds a gated persistent memory stream to retain intermediate facts. Evaluations on GSM8K, HotpotQA, and ProsQA using GPT-2 show consistent improvements, especially on multi-hop tasks.
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.
MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.
Proposes NF-CoT, a latent reasoning framework using normalizing flows to model continuous thoughts in LLMs, preserving autoregressive advantages and achieving better code generation performance with lower cost.
This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.
LaSR proposes a latent reasoning training paradigm for context-aware speech recognition, aligning chain-of-thought supervision around acoustic features to improve terminology recognition without added latency, outperforming standard fine-tuning on Fun-Audio-Chat.
Geometric Latent Reasoning (GLR) introduces a geometric path-approximation method for latent reasoning in LLMs, enabling shorter generations while maintaining accuracy across mathematical reasoning benchmarks.
This paper introduces Semantic Step Prediction, which applies geometric regularization at reasoning step boundaries rather than random token positions, achieving 168× better multi-step latent forecasting on ProcessBench compared to frozen baselines.
CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.
This paper investigates whether multimodal large language models (MLLMs) can leverage Miller indices as a latent representation to reason about crystallographic fracture geometry from visual inputs, evaluating their ability to infer physically valid plane hypotheses and determine when such representation is applicable across materials like ceramics, glass, metals, and concrete.
The paper introduces TTE-Flash, a method that replaces explicit chain-of-thought reasoning with latent think tokens to generate reasoning-aware multimodal representations at constant inference cost, outperforming explicit CoT baselines on the MMEB-v2 benchmark.
LaMR introduces a structured pruning framework for coding agents that decomposes code relevance into semantic evidence and dependency support dimensions, using dedicated CRFs and a mixture-of-experts gate to reduce token usage by up to 31% while maintaining or improving task performance.
This research paper proposes a finite-answer theory to analyze when language models commit to an answer before verbalizing it. Using Qwen3-4B-Instruct, the authors demonstrate that answer preference stabilizes significantly before the final output is generated, offering insights into latent reasoning and model internal states.