Tag
This paper identifies a 'concept bottleneck' in the CoCoNuT latent reasoning paradigm where hidden states are overwritten across passes, and proposes AGCLR, which adds a gated persistent memory stream to retain intermediate facts. Evaluations on GSM8K, HotpotQA, and ProsQA using GPT-2 show consistent improvements, especially on multi-hop tasks.
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.
MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.
Proposes NF-CoT, a latent reasoning framework using normalizing flows to model continuous thoughts in LLMs, preserving autoregressive advantages and achieving better code generation performance with lower cost.
This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.
LaSR proposes a latent reasoning training paradigm for context-aware speech recognition, aligning chain-of-thought supervision around acoustic features to improve terminology recognition without added latency, outperforming standard fine-tuning on Fun-Audio-Chat.
Geometric Latent Reasoning (GLR) introduces a geometric path-approximation method for latent reasoning in LLMs, enabling shorter generations while maintaining accuracy across mathematical reasoning benchmarks.
This paper introduces Semantic Step Prediction, which applies geometric regularization at reasoning step boundaries rather than random token positions, achieving 168× better multi-step latent forecasting on ProcessBench compared to frozen baselines.
CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.
This paper investigates whether multimodal large language models (MLLMs) can leverage Miller indices as a latent representation to reason about crystallographic fracture geometry from visual inputs, evaluating their ability to infer physically valid plane hypotheses and determine when such representation is applicable across materials like ceramics, glass, metals, and concrete.
The paper introduces TTE-Flash, a method that replaces explicit chain-of-thought reasoning with latent think tokens to generate reasoning-aware multimodal representations at constant inference cost, outperforming explicit CoT baselines on the MMEB-v2 benchmark.
LaMR introduces a structured pruning framework for coding agents that decomposes code relevance into semantic evidence and dependency support dimensions, using dedicated CRFs and a mixture-of-experts gate to reduce token usage by up to 31% while maintaining or improving task performance.
This research paper proposes a finite-answer theory to analyze when language models commit to an answer before verbalizing it. Using Qwen3-4B-Instruct, the authors demonstrate that answer preference stabilizes significantly before the final output is generated, offering insights into latent reasoning and model internal states.
LatentRAG is a novel framework that shifts reasoning and retrieval for agentic RAG into continuous latent space, reducing inference latency by approximately 90% while maintaining performance comparable to explicit methods.
This paper investigates multilingual latent reasoning in large reasoning models across 11 languages, revealing that while latent reasoning capabilities exist, they are unevenly distributed—stronger in resource-rich languages and weaker in low-resource ones. The study finds that despite surface-level differences, the internal reasoning mechanisms are largely aligned with an English-centered pathway.
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.