Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

arXiv cs.CL Papers

Summary

This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.

arXiv:2605.07106v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:47 AM

# Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
Source: [https://arxiv.org/html/2605.07106](https://arxiv.org/html/2605.07106)
Jin Cui∗1\*1, Xinyue Long∗2\*2, Xunyong Zhang11, Yadong Zhang11, Chuanchang Su11,Jingye Gan11,Boran Zhao†2\\dagger 2,Pengju Ren11 11State Key Laboratory of Human\-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University 22School of Software Engineering, State Key Laboratory of Human\-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University andycui@stu\.xjtu\.edu\.cn, \{boranzhao, pengjuren\}@xjtu\.edu\.cn

###### Abstract

Multimodal Large Language Models \(MLLMs\) have made remarkable progress on vision\-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine\-grained perception\. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance\-agnostic patterns, and are often bypassed during answer generation\. To address these issues, we proposeRIS\(Retrieve,Integrate, andSynthesize\), a spatial\-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation\. We first construct a step\-wise grounded reasoning dataset with bounding boxes and region\-specific semantic descriptions\. Built on this supervision,RISanchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary\-aligned decoding\. Experiments on V∗, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open\-source and latent reasoning baselines\. Further analyses demonstrate thatRISlearns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs\.

11footnotetext:Equal contribution\.22footnotetext:Corresponding author\.## 1Introduction

Multimodal Large Language Models \(MLLMs\) have achieved remarkable success across diverse vision\-language tasks, largely due to Chain\-of\-Thought \(CoT\) reasoning\[[23](https://arxiv.org/html/2605.07106#bib.bib1),[8](https://arxiv.org/html/2605.07106#bib.bib2)\]\. However, these models still treat visual information as static preconditions, converting continuous visual features into discrete textual tokens and reasoning only within the textual domain\[[10](https://arxiv.org/html/2605.07106#bib.bib4)\]\. This creates an inherent bottleneck:fine\-grained visual evidence must be compressed into language tokens before it can participate in reasoning\. Recent “thinking with images”\[[16](https://arxiv.org/html/2605.07106#bib.bib14)\]methods alleviate this issue by injecting visual evidence through external tools\[[26](https://arxiv.org/html/2605.07106#bib.bib5),[12](https://arxiv.org/html/2605.07106#bib.bib6),[22](https://arxiv.org/html/2605.07106#bib.bib7)\]or programmatic operations\[[4](https://arxiv.org/html/2605.07106#bib.bib8),[17](https://arxiv.org/html/2605.07106#bib.bib9),[6](https://arxiv.org/html/2605.07106#bib.bib10)\], but their flexibility is limited by predefined tool interfaces and external execution\. This necessitates a more unified solution to move intermediate visual reasoning inside the model, allowing it to manipulate question\-relevant visual evidence directly in continuous hidden representations\.

Latent visual reasoning offers a promising path toward this goal\. Unlike text\-based CoT, latent states provide an expressive workspace where visual patterns and abstract concepts can be represented without being discretized into language\[[16](https://arxiv.org/html/2605.07106#bib.bib14),[27](https://arxiv.org/html/2605.07106#bib.bib15)\]\. Yet this freedom also introduces a fundamental tension\. Since the model’s reasoning behavior and decoding interface are largely shaped by language pretraining, effective latent visual reasoning must not only exploit the expressive capacity of a latent visual manifoldℳv​i​s\\mathcal\{M\}\_\{vis\}, but also remain compatible with the vocabulary\-aligned manifoldℳv​o​c​a​b\\mathcal\{M\}\_\{vocab\}where pretrained reasoning circuits and language\-grounded decoding are organized\. Existing methods such as LVR\[[9](https://arxiv.org/html/2605.07106#bib.bib18)\]and Monet\[[19](https://arxiv.org/html/2605.07106#bib.bib19)\]take important steps by reconstructing visual tokens from latent states or generating continuous embeddings as intermediate visual thoughts, but they do not fully resolve this compatibility problem\.

In this work, we first analyzewhy existing latent visual reasoning methods remain ineffective despite forming distinct latent visual representations\. Recent causal mediation study\[[11](https://arxiv.org/html/2605.07106#bib.bib21)\]reveals pronounced*Input–Latent*and*Latent–Answer*disconnects, where latent tokens are weakly grounded in visual inputs and exert limited influence on final predictions\. Our empirical analysis further shows that these failures are closely tied to manifold divergence\. Specifically, \(1\) weakly supervised hidden states may drift away from the pretrained vocabulary\-aligned manifold and tend to collapse into highly similar, instance\-agnostic trajectories; \(2\) answer tokens usually bypass latent tokens and rely directly on the input image and question; \(3\) the model must converge high\-entropy latent visual states abruptly to low\-entropy answer tokens, which can induce representation mismatch during language decoding\.

To address these challenges, we proposeRIS\(Retrieve,Integrate, andSynthesize\), a grounded latent visual reasoning framework that develops latent space as a compatible extension of pretrained reasoning circuits rather than a detached visual manifold\. To support training, we first construct a step\-wise grounded visual reasoning dataset with96​k96ksamples in which each reasoning step is paired with bounding\-box spatial supervision and a region\-specific semantic description\. Built on this spatial\-semantic supervision,RISstructures latent tokens as directed visual evidence retrieval states: bounding\-box supervision anchorswhere to look, semantic alignment specifieswhat is seen, and a progressive attention mask forces task\-relevant evidence to flow through latent tokens instead of being bypassed during answer generation\. Slots beyond the annotated reasoning steps are optimized solely through the final\-answer objective, endowing them with the emergent ability to integrate and synthesize evidence retrieved by grounded slots\. Finally, we demonstrate that generating a slightly elaborated answer between latent reasoning and final option\-level answer acts as manifold transition tokens since it gradually reduces the entropy of reasoning paths from latent states to low\-entropy answer tokens rather than abrupt degradation, while providing dense supervision during training\.

We evaluateRISon five challenging visual reasoning benchmarks\.RISconsistently outperforms strong baselines, with particularly clear gains on tasks requiring localization, structured visual search, and multi\-step perceptual reasoning\. Further analyses show thatRISproduces more diverse, interpretable, and task\-dependent latent trajectories\. Our contributions are summarized as follows:

- ⋆\\starWe provide a systematic analysis of latent visual reasoning in MLLMs, identifying the interaction between vocabulary\-aligned manifoldℳv​o​c​a​b\\mathcal\{M\}\_\{vocab\}and latent visual manifoldℳv​i​s\\mathcal\{M\}\_\{vis\}, and revealing manifold divergence, latent trajectory collapse, and answer bypassing as key obstacles\.
- ⋆\\starWe construct an96​k96k\-sampleGrounded Latent Supervision Dataset \(GLSD\)and proposeRIS, a spatial\-semantic grounded latent reasoning framework that structures latent tokens to retrieve task\-relevant visual evidence while developing latent space as a compatible extension of pretrained reasoning circuits rather than a detached visual manifold\.
- ⋆\\starWe demonstrate consistent improvements across visual reasoning benchmarks, especially on localization and multi\-step visual reasoning tasks, and further show thatRISlearns diverse, interpretable, and progressively integrated latent reasoning trajectories with state\-of\-the\-art performance\.

## 2Related Work

From Static Perception to Internal Visual Imagination\.Most current MLLMs adopt text\-space CoT reasoning to solve complex visual tasks, treating visual inputs as static premises for language\-based inference\[[28](https://arxiv.org/html/2605.07106#bib.bib24),[21](https://arxiv.org/html/2605.07106#bib.bib25)\]\. Although effective, such methods reason through discrete text tokens, which provide an indirect and lossy representation for fine\-grained visual understanding\. Recent*Thinking with Images*\[[16](https://arxiv.org/html/2605.07106#bib.bib14)\]methods alleviate this limitation by using external visual tools to manipulate and inject intermediate visual evidence\[[26](https://arxiv.org/html/2605.07106#bib.bib5)\]\. However, their effectiveness is constrained by the availability, design, and granularity of predefined tools\. This motivates internal visual reasoning, where models reason over visual evidence in continuous latent states rather than translating it into text or pixels\.

Latent Reasoning\.Recent studies have explored continuous latent spaces as an alternative to discrete token\-level reasoning\. Representative approaches include utilizing recursive hidden states for breadth\-first search\[[5](https://arxiv.org/html/2605.07106#bib.bib27)\], self\-distillation of explicit reasoning traces\[[15](https://arxiv.org/html/2605.07106#bib.bib26)\], and implicit reasoning via superposed latent chains\[[2](https://arxiv.org/html/2605.07106#bib.bib16)\]\. While these methods enhance reasoning efficiency, they remain constrained in textual space\. Extending them to MLLMs is non\-trivial: representing visual evidence by vocabulary embeddings or weakly supervised hidden states can distort fine\-grained cues such as texture, color, and spatial layout\. Effective visual latent reasoning requires a visual manifold that can preserve rich perceptual evidence while remaining compatible with language\-grounded reasoning\.

Latent Visual Reasoning\.To move beyond static perception toward internal visual imagination, recent paradigms have explored performing logical deductions directly within the latent space\. LVR\[[9](https://arxiv.org/html/2605.07106#bib.bib18)\]performs autoregressive reasoning within the visual embedding space by reconstructing task\-critical tokens from latent states\. Monet generates continuous embeddings that serve as intermediate visual thoughts and aligns them with the visual semantic space through a distillation pipeline\[[19](https://arxiv.org/html/2605.07106#bib.bib19)\]\. Mirage further treats hidden states as latent visual tokens to build multimodal reasoning trajectories without pixel\-level image synthesis\[[25](https://arxiv.org/html/2605.07106#bib.bib22)\]\. Despite these advances, recent diagnostic studies reveal a persistent causality gap:latent tokens are often weakly grounded in visual inputs and exert limited influence on final answers\[[11](https://arxiv.org/html/2605.07106#bib.bib21)\]\. Our analyses further reveal a fundamentalmanifold divergencein existing baselines, where latent trajectories drift into deep, uncalibrated regions far from the pre\-trained semantic anchors\. These limitations thus motivate our grounded latent reasoning framework\.

## 3Analysis on Reasoning Manifold

To understand how latent tokens shape the reasoning trajectory of models trained for latent reasoning, we develop a geometric analysis that visualizes the path traversed by hidden states during a single inference, relative to both the original base\-model manifold and the vocabulary embedding space\. The analysis is motivated by a simple question:as the model generates a sequence of latent tokens followed by the decoded language answer tokens, how does its internal representation travel through the joint space of hidden states and vocabulary embeddings?

We construct a dataset of reasoning trajectories from an evaluation set ofNNsamples\. For each sampleii, a forward\-decoding pass produces last\-layer hidden states\{𝐡t\(i\)\}t=1Ti\\\{\\mathbf\{h\}^\{\(i\)\}\_\{t\}\\\}\_\{t=1\}^\{T\_\{i\}\}, where each state is labeled as belonging to either the*latent*or*answer*phase\. We denote the aggregate of these reasoning states across all samples asℋRIS\\mathcal\{H\}\_\{\\textit\{RIS\}\}\. As references, we collect the corresponding hidden states from a frozen base model, denoted asℋbase\\mathcal\{H\}\_\{\\mathrm\{base\}\}, alongside the vocabulary embedding matrix𝐄∈ℝV×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{V\\times d\}\. To visualize manifold distributions and reasoning trajectories in a shared space, we fit PCA jointly onℋbase\\mathcal\{H\}\_\{\\mathrm\{base\}\},ℋRIS\\mathcal\{H\}\_\{\\textit\{RIS\}\}, and𝐄\\mathbf\{E\}, and project each trajectory onto the plane spanned by the leading two principal components\.

![Refer to caption](https://arxiv.org/html/2605.07106v1/x1.png)Figure 1:Geometric analysis of latent reasoning paradigms: \(a\) manifold distribution and trajectories, \(b\) layer\-wise parameter shift relative to the base model, and \(c\) attention pattern of answer tokens\.### 3\.1Manifold Compatibility and Trajectory Dynamics

We use*manifold*to refer to the empirical support of high\-dimensional hidden\-state or embedding representations\. Figure[1](https://arxiv.org/html/2605.07106#S3.F1)compares the hidden\-state distributions of the frozen base model as an empirical reference for the pretrained vocabulary\-aligned manifold with those induced by different latent reasoning training methods\. In LVR and Monet, the learned latent states are visibly separated from this reference manifold, suggesting that they form distinct latent visual manifolds with richer visual expressiveness but also introduce representational distribution shifts\. Such separation can weaken compatibility with the pretrained reasoning circuits acquired during large\-scale language pretraining and with the language decoding process, which partly explains their degraded performance\.

The trajectory visualization provides a dynamic view of this phenomenon\. Successful reasoning paths tend to remain connected to the vocabulary\-aligned manifold, whereas failed paths are more often trapped within detached latent visual regions\. This does not imply that correct reasoning must explicitly return to the vocabulary manifold at specific steps; rather, effective latent visual reasoning should remain compatible with the pretrained representation regime, allowing the model to exploit existing reasoning circuits while incorporating fine\-grained visual evidence in latent space\. This supports our view:latent visual reasoning should not replace the model’s original reasoning manifold, but should develop as a compatible extension of it\.

![Refer to caption](https://arxiv.org/html/2605.07106v1/x2.png)Figure 2:Dataset construction pipeline\. An MLLM decomposes each QA pair into several grounded reasoning steps, which are then verified and calibrated by Grounding DINO\.
### 3\.2Layer\-wise Adaptation Pattern

To further analyze the observed manifold compatibility, Figure[1](https://arxiv.org/html/2605.07106#S3.F1)\(b\) measures the layer\-wise parameter shift from the base model\. LVR and Monet show limited changes in the middle layers but large shifts in the output layers, indicating that their adaptation is concentrated near the final decoding interface rather than distributed across the internal computation stack\. This pattern suggests that they form limited internal circuits for latent visual reasoning and instead rely on late\-stage compensation to map detached latent visual states back to language\-grounded outputs\. In contrast, our method produces a smoother and more sustained shifts across the middle layers, followed by much milder changes near the output layers\. This indicates that the model gradually adapts its internal computation to support latent visual reasoning while preserving the vocabulary\-aligned manifold near the decoding interface\.

### 3\.3Latent Bypassing and Trajectory Collapse

The attention analysis further explains why a detached latent visual manifold does not necessarily lead to effective latent reasoning\. As shown in Figure[1](https://arxiv.org/html/2605.07106#S3.F1)\(c\), answer tokens allocate most of their attention to original image and text tokens, while assigning only minimal attention to latent tokens\. This indicates that the final decoding largely bypasses the latent reasoning tokens instead of using them as intermediate computational states\. This observation is consistent with\[[11](https://arxiv.org/html/2605.07106#bib.bib21)\]that latent tokens are weakly grounded in the visual premises and exert limited causal influence on the final answer\. The trajectories of LVR and Monet in Figure[1](https://arxiv.org/html/2605.07106#S3.F1)\(a\) provide a complementary view\. Across samples, their latent trajectories are highly similar and densely collapsed, suggesting limited instance\-specific reasoning information\. Thus, although these methods appear to form latent visual manifolds, such manifolds are not effectively integrated into the model’s computation: they neither receive sufficient input\-grounded variation nor provide a reliable pathway back to the vocabulary\-aligned manifold\.

Taken together, these analyses suggest:Effective latent reasoning requires more than learning a separated latent visual manifold\. It must establish compatible internal circuits that ground latent states in visual inputs, preserve access to pretrained reasoning circuits, and smoothly transfer visual abstractions back to the vocabulary\-aligned manifold for language\-grounded answer generation\.

![Refer to caption](https://arxiv.org/html/2605.07106v1/x3.png)Figure 3:Overview ofRISframework\. Visual and semantic decoders are only used for supervision in training and will be removed during inference\. Full attention mask used during inference\.

## 4Method

In this section, we first elaborate on the construction of our step\-wiseGrounded Latent Supervision Dataset \(GLSD\), which provides spatial\-semantic supervision for intermediate visual reasoning\. Built on this, we introduceRIS, a grounded latent reasoning framework that uses last\-layer hidden states as continuous latent tokens\. Following common practice,RISallocates a fixed number of latent tokens and feeds them forward as continuous embeddings\. Figure[3](https://arxiv.org/html/2605.07106#S3.F3)illustrates the overall training pipeline\.

### 4\.1Dataset Construction

Since existing datasets rarely provide fine\-grained supervision for intermediate reasoning, we convert standard GQA\[[7](https://arxiv.org/html/2605.07106#bib.bib33)\]data into a unified step\-wise grounded format and curate such supervision through a two\-stage pipeline\. First, prompt an MLLM to decompose each question\-answer pair into a 2–4 step trace, where each step contains an operation tag, a region\-specific semantic description, visual information, and preliminary bounding\-box coordinates, leading to a full answer \(as transition tokens\) and final answer as visualized in Appendix[E](https://arxiv.org/html/2605.07106#A5)\. Then, we feed the generated descriptions and original image into Grounding DINO\[[13](https://arxiv.org/html/2605.07106#bib.bib23)\]to verify and calibrate the MLLM\-proposed coordinates\.

The verified trace is serialized as𝒯\\mathcal\{T\}and padded to the fixed latent budgetKK\. Each supervised step is represented as<step\_start\> <bbox\>bib\_\{i\}</bbox\>viv\_\{i\}<step\_end\>, wherebib\_\{i\}denotes the normalized bounding box coordinates andviv\_\{i\}denotes the corresponding visual information\. Letmi∈\{0,1\}m\_\{i\}\\in\\\{0,1\\\}indicate whether theii\-th slot has explicit step\-wise supervision\. For unsupervised slots withmi=0m\_\{i\}=0, we insert<step\_start\> \[synthesize\] <step\_end\>, allowing them to learn evidence integration from downstream answer supervision\. This yields96​k96ktraining samples\.

### 4\.2Training Phase 1: Explicit Grounded Reasoning Initialization

In Phase 1, we initialize the MLLM with explicit grounded visual reasoning before moving to continuous latent computation\. Given image embeddings𝒳v\\mathcal\{X\}\_\{v\}and a tokenized query𝒳q\\mathcal\{X\}\_\{q\}, the model is trained to generate the structured reasoning trace𝒯\\mathcal\{T\}followed by thefull answerAsA\_\{s\}and final answerAA\. Unlike generic textual rationales,𝒯\\mathcal\{T\}contains step\-wise normalized bounding\-box coordinates, fine\-grained visual information, and boundary tokens, thereby teaching the model to retrieve regions, integrate visual evidence, and synthesize the answer in a discrete but explicitly grounded format\.

We optimize the backboneℳ\\mathcal\{M\}with the standard next\-token prediction \(NTP\) objective:

ℒNTP=−∑t=1\|𝒯⊕As⊕A\|log⁡Pℳ​\(yt∣y<t,𝒳v,𝒳q\)\.\\small\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}=\-\\sum\_\{t=1\}^\{\|\\mathcal\{T\}\\oplus A\_\{s\}\\oplus A\|\}\\log P\_\{\\mathcal\{M\}\}\\left\(y\_\{t\}\\mid y\_\{<t\},\\mathcal\{X\}\_\{v\},\\mathcal\{X\}\_\{q\}\\right\)\.\(1\)
This stage equips the backbone with a reliable text\-space grounded reasoning prior\. After training, we use the Phase\-1 trained model to encode the region\-specific descriptions in each step\. The resulting cached embeddings serve as stable semantic anchors for Phase 2 training, preventing the latent\-space supervision targets from drifting during continuous\-token training\.

### 4\.3Training Phase 2a: Side\-Head Grounding and Semantic Alignment

To transition from discrete textual reasoning to the continuous latent space, we introduce a set of fixed\-capacity visual latent tokens𝒞=\{c1,c2,…,cK\}\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{K\}\\\}\. In this phase, we freeze the MLLM backbone and calibrate two lightweight, task\-specific side heads:Visual Grounding Decoder\(fregf\_\{\\text\{reg\}\}\) andSemantic Alignment Decoder\(fdescf\_\{\\text\{desc\}\}\)\. For each reasoning stepttwheremt=1m\_\{t\}=1, we extract the last\-layer hidden stateztz\_\{t\}at the<step\_start\>token in the explicit reasoning trace, which summarizes the context before generating the step content\. The two decoders then operate as explicit supervision signals:

Spatial Grounding:fregf\_\{\\text\{reg\}\}predicts the bounding box coordinatesb^t=freg​\(zt\)\\hat\{b\}\_\{t\}=f\_\{\\text\{reg\}\}\(z\_\{t\}\)\. We optimize this using a combination ofℓ1\\ell\_\{1\}distance and Generalized IoU \(GIoU\) loss against the verified boxesbtb\_\{t\}:

ℒreg=1∑t=1Kmt​∑t=1Kmt​\(‖b^t−bt‖1\+λGIoU​ℒGIoU​\(b^t,bt\)\)\\small\\mathcal\{L\}\_\{\\text\{reg\}\}=\\frac\{1\}\{\\sum\_\{t=1\}^\{K\}m\_\{t\}\}\\sum\_\{t=1\}^\{K\}m\_\{t\}\\Big\(\|\|\\hat\{b\}\_\{t\}\-b\_\{t\}\|\|\_\{1\}\+\\lambda\_\{\\text\{GIoU\}\}\\mathcal\{L\}\_\{\\text\{GIoU\}\}\(\\hat\{b\}\_\{t\},b\_\{t\}\)\\Big\)\(2\)
Semantic Anchoring:To prevent the latent representation from drifting into uninterpretable regions,fdescf\_\{\\text\{desc\}\}projectsztz\_\{t\}into a semantic embedding space\. We minimize the cosine distance between the projected vector and the pre\-computed text embeddingsete\_\{t\}of the corresponding region description:

ℒdesc=1∑t=1Kmt​∑t=1Kmt​\(1−cos⁡\(fdesc​\(zt\),et\)\)\\small\\mathcal\{L\}\_\{\\text\{desc\}\}=\\frac\{1\}\{\\sum\_\{t=1\}^\{K\}m\_\{t\}\}\\sum\_\{t=1\}^\{K\}m\_\{t\}\\Big\(1\-\\cos\(f\_\{\\text\{desc\}\}\(z\_\{t\}\),e\_\{t\}\)\\Big\)\(3\)
The total loss for this warm\-up phase isℒPhase2a=λr​ℒreg\+λd​ℒdesc\\mathcal\{L\}\_\{\\text\{Phase2a\}\}=\\lambda\_\{r\}\\mathcal\{L\}\_\{\\text\{reg\}\}\+\\lambda\_\{d\}\\mathcal\{L\}\_\{\\text\{desc\}\}\. This supervised calibration imposes spatial grounding to specify*where to look*, while semantic anchoring to specify*what is seen*\. These calibrated side heads provide stable spatial\-semantic constraints for converting explicit reasoning steps into grounded latent tokens in the subsequent phase\.

### 4\.4Training Phase 2b: Progressive Latent Internalization

With the side heads calibrated, Phase 2b progressively converts explicit textual reasoning into continuous latent computation\. Following theCoconutcurriculum paradigm\[[5](https://arxiv.org/html/2605.07106#bib.bib27)\], we use a step\-wise replacement schedules∈\{1,…,K\}s\\in\\\{1,\\dots,K\\\}: at stagess, the firstsstextual reasoning blocks are replaced by their corresponding latent states𝒞≤s\\mathcal\{C\}\_\{\\leq s\}, while the remaining textual steps𝒯\>s\\mathcal\{T\}\_\{\>s\}and thefull answerAsA\_\{s\}and final answerAAare still generated autoregressively as discrete text\. Unlike Phase 2a, the MLLM backbone is unfrozen, allowing its internal computation to adapt to latent inputs\.

The model is trained with next\-token prediction on the remaining textual trace and final answer:

ℒcot=−∑t=1\|𝒯\>s\|log⁡Pℳ​\(yt∣y<t,𝒳v,𝒳q,𝒞≤s\),ℒans=−∑t=1\|As⊕A\|log⁡Pℳ​\(at∣a<t,𝒳v,𝒳q,𝒞≤s,𝒯\>s\)\\small\\mathcal\{L\}\_\{\\text\{cot\}\}=\-\\sum\_\{t=1\}^\{\|\\mathcal\{T\}\_\{\>s\}\|\}\\log P\_\{\\mathcal\{M\}\}\(y\_\{t\}\\mid y\_\{<t\},\\mathcal\{X\}\_\{v\},\\mathcal\{X\}\_\{q\},\\mathcal\{C\}\_\{\\leq s\}\),\\hskip 5\.69054pt\\mathcal\{L\}\_\{\\text\{ans\}\}=\-\\sum\_\{t=1\}^\{\|A\_\{s\}\\oplus A\|\}\\log P\_\{\\mathcal\{M\}\}\(a\_\{t\}\\mid a\_\{<t\},\\mathcal\{X\}\_\{v\},\\mathcal\{X\}\_\{q\},\\mathcal\{C\}\_\{\\leq s\},\\mathcal\{T\}\_\{\>s\}\)\(4\)
To preserve grounding during internalization, the side\-head losses are applied only to supervised latent slots that have been replaced, i\.e\.,\{t≤s∣mt=1\}\\\{t\\leq s\\mid m\_\{t\}=1\\\}\. The overall objective is

ℒPhase2b=ℒans\+α​ℒcot\+λr​ℒreg\+λd​ℒdesc\\mathcal\{L\}\_\{\\text\{Phase2b\}\}=\\mathcal\{L\}\_\{\\text\{ans\}\}\+\\alpha\\mathcal\{L\}\_\{\\text\{cot\}\}\+\\lambda\_\{r\}\\mathcal\{L\}\_\{\\text\{reg\}\}\+\\lambda\_\{d\}\\mathcal\{L\}\_\{\\text\{desc\}\}\(5\)
Asssincreases, information previously expressed by text is gradually internalized into continuous latent states\. By the end of this curriculum, the model learns to retrieve and transform grounded visual evidence in latent space with reduced dependence on explicit verbalization\.

### 4\.5Training Phase 3: Bottlenecked Latent Integration

In Phase 3, all explicit reasoning steps are replaced by latent tokens, so the intermediate reasoning process is fully mediated by the latent tokens𝒞\\mathcal\{C\}\. To prevent answer decoding from bypassing these tokens and directly attending to the original image and query tokens\(𝒳v,𝒳q\)\(\\mathcal\{X\}\_\{v\},\\mathcal\{X\}\_\{q\}\), we introduce aProgressive Attention Mask\. Specifically, we anneal a masking probabilityρ​\(τ\)\\rho\(\\tau\)over training stepτ\\tauand sampleM∼Bernoulli​\(ρ​\(τ\)\)M\\sim\\mathrm\{Bernoulli\}\(\\rho\(\\tau\)\)to modulate the causal attention matrix\. Asρ​\(τ\)\\rho\(\\tau\)increases, answer tokens are gradually forced to rely on𝒞\\mathcal\{C\}, making latent states the information conduit for task\-relevant visual and semantic evidence\.

Since no textual reasoning steps remain in this phase, the textual CoT loss is removed\. The model is trained with the final\-answer objective, together with side\-head constraints on supervised latent slots:

ℒPhase3=ℒans\+λr​ℒreg\+λd​ℒdesc\\mathcal\{L\}\_\{\\text\{Phase3\}\}=\\mathcal\{L\}\_\{\\text\{ans\}\}\+\\lambda\_\{r\}\\mathcal\{L\}\_\{\\text\{reg\}\}\+\\lambda\_\{d\}\\mathcal\{L\}\_\{\\text\{desc\}\}\(6\)
Under the fully masked condition, the answer loss is conditioned on the latent tokens:

ℒans=−∑t=1\|As⊕A\|log⁡Pℳ​\(ut∣u<t,𝒞\),\\small\\mathcal\{L\}\_\{\\mathrm\{ans\}\}=\-\\sum\_\{t=1\}^\{\|A\_\{s\}\\oplus A\|\}\\log P\_\{\\mathcal\{M\}\}\\left\(u\_\{t\}\\mid u\_\{<t\},\\mathcal\{C\}\\right\),\(7\)
whereAsA\_\{s\}andAAdenote thefull answerbridge and the final answer,utu\_\{t\}indexes concated tokens\.

Under this information bottleneck, the supervised latent tokens remain spatially and semantically grounded through the side\-head losses, while the free tokens, which receive no direct spatial\-semantic supervision, are optimized only through the answer objective and are therefore encouraged to aggregate and synthesize the evidence retrieved by earlier grounded tokens\.

Finally, to facilitate a smooth return from the latent visual manifold to the discrete vocabulary manifoldℳv​o​c​a​b\\mathcal\{M\}\_\{vocab\}, we utilize the shortfull answerAsA\_\{s\}\(as defined in Phase 1\) as a sequence ofManifold Transition Tokens\. Instead of forcing an abrupt jump from synthesized latent states to final answer decoding,AsA\_\{s\}provides an autoregressive intermediate path that gradually maps latent visual representations back toward the pretrained vocabulary\-aligned manifoldℳv​o​c​a​b\\mathcal\{M\}\_\{vocab\}\. Its dense next\-token supervision stabilizes this transition and improves the compatibility between latent visual reasoning and language\-grounded answer generation\.

## 5Experiments

### 5\.1Experiment Setup

Training and Evaluation Setup\.We adopt Qwen2\.5\-VL\-7B\[[1](https://arxiv.org/html/2605.07106#bib.bib28)\]as our base model\. The training process follows our proposed three\-phase pipeline and all parameters are detailed in Appendix[B](https://arxiv.org/html/2605.07106#A2)\.

Evaluated Benchmarks\.To comprehensively evaluate our proposed method, we conduct experiments on a diverse set of challenging perception and reasoning benchmarks:V∗\\text\{V\}^\{\*\}\[[24](https://arxiv.org/html/2605.07106#bib.bib31)\], HRBench4K\[[20](https://arxiv.org/html/2605.07106#bib.bib32)\], HRBench8K\[[20](https://arxiv.org/html/2605.07106#bib.bib32)\], MMVP\[[18](https://arxiv.org/html/2605.07106#bib.bib29)\], and BLINK\[[3](https://arxiv.org/html/2605.07106#bib.bib30)\]\.

Baselines\.We compareRISagainst a variety of baselines: \(1\)Proprietary Models:GPT\-4o; \(2\)Open\-Source Base Model:Qwen2\.5\-VL\-7B; \(3\)Vanilla SFT:Qwen2\.5\-VL\-7B\+GLSD, which finetunes the base model on our curatedGrounded Latent Supervision Dataset \(GLSD\)for same training steps asRIS; \(4\)Latent Visual Reasoning Methods:LVR\[[9](https://arxiv.org/html/2605.07106#bib.bib18)\], Monet\[[19](https://arxiv.org/html/2605.07106#bib.bib19)\], and CoVT\[[14](https://arxiv.org/html/2605.07106#bib.bib20)\]\. Furthermore, to investigate the benefit of reinforcement learning on our method, we introduceRIS\+VLPO, a variant that further optimizes our model usingVisual\-latent Policy Optimization \(VLPO\)proposed by Monet\.

### 5\.2Main Results

Table[2](https://arxiv.org/html/2605.07106#S5.T2)summarizes the performance ofRISand baselines across five visual reasoning benchmarks\. Overall,RISconsistently outperforms both the open\-source backbone and existing latent visual reasoning baselines\. The GLSD baseline, which retains explicit textual reasoning, yields much smaller gains\. This indicates that the improvements are not only due to extra supervision but mainly stem from internalizing grounded visual evidence into latent states and performing latent reasoning\. AlthoughRIS\+VLPOdoes not consistently yield further gains, it remains a promising direction for adaptive and stable latent computation, while a reliable latent variant is still underexplored\.

ModelS\.R\.O\.L\.R\.R\.CountingR\.D\.Base86\.8143\.6239\.8768\.3067\.26Base\+GLSD87\.0349\.5741\.1367\.5965\.66LVR86\.0150\.8243\.2869\.1776\.61Monet85\.3145\.0839\.5570\.8375\.81CVOT87\.4153\.2038\.0665\.8377\.65RIS89\.6054\.6344\.2572\.0276\.50Improvement\+2\.79\+11\.01\+4\.38\+3\.72\+9\.24

Table 1:Performance onBLINK\.\(S\.R\.: Spatial Reasoning, O\.L\.: Object Localization, R\.R\.: Relative Reflectance, R\.D\.: Relative Depth\)\.ModelV\*HRBench4KHRBench8KMMVPBLINKOverallAttributeSpatialOverallFSPFCPOverallFSPFCP\\cellcolororange\!10Proprietary ModelGPT\-4o65\.1569\.6858\.3954\.7064\.9344\.5151\.1257\.0845\.1272\.0063\.55\\cellcolorblue\!10Open\-Source ModelQwen2\.5\-VL\-7B76\.6577\.1274\.3568\.3080\.6056\.0364\.3374\.4254\.2463\.4954\.94Qwen2\.5\-VL\-7B\+GLSD78\.2578\.3978\.1170\.5883\.3057\.8664\.7074\.6954\.7066\.8056\.25LVR\[[9](https://arxiv.org/html/2605.07106#bib.bib18)\]80\.6083\.2677\.9470\.8883\.2557\.5063\.5075\.0052\.0071\.0356\.79Monet\[[19](https://arxiv.org/html/2605.07106#bib.bib19)\]81\.9081\.9481\.8669\.9784\.0155\.9367\.0178\.5955\.4370\.0056\.70COVT\[[14](https://arxiv.org/html/2605.07106#bib.bib20)\]79\.1081\.0577\.1571\.9085\.5058\.3068\.4079\.3057\.5058\.7057\.40\\cellcolorgreen\!8Our ModelRIS83\.7584\.2683\.2473\.2386\.3360\.1268\.5279\.0557\.9873\.5560\.60RIS\+VLPO\[[19](https://arxiv.org/html/2605.07106#bib.bib19)\]81\.7681\.2482\.2871\.7982\.8360\.7563\.0572\.6753\.4273\.7660\.95Relative Improvement\+7\.10\+7\.14\+8\.89\+4\.93\+5\.73\+4\.09\+4\.19\+4\.63\+3\.74\+10\.27\+5\.66

Table 2:Main results on visual reasoning benchmarks across proprietary, open\-source, and latent visual reasoning baselines \(with Qwen2\.5\-VL\-7B as the same backbone\) with5 latent tokens\.The performance pattern further aligns with the characteristics of each benchmark\.RISachieves larger gains on benchmarks requiring precise localization, structured visual search, and multi\-step perceptual reasoning, such as V∗and BLINK\. The detailed BLINK results in Table[1](https://arxiv.org/html/2605.07106#S5.T1)show particularly strong improvements on tasks closely tied to visual grounding and sequential evidence retrieval, includingSpatial Reasoning,Object Localization, andRelational Reasoning\. In contrast, on MMVP, where performance is more constrained by the visual encoder’s ability, the gains are more moderate\.

### 5\.3Analysis on Latent Behaviors

#### Impact of Latent Token Budget\.

Figure[5](https://arxiv.org/html/2605.07106#S5.F5)\(c\) studies the effect of latent token budget\. Although more latent tokens provide additional computation, performance does not improve monotonically\. This is because most training samples contain only three to four supervised reasoning steps, which already match the typical length of grounded logical reasoning\. Tokens beyond this range receive no spatial\-semantic supervision and only learn evidence integration from the final\-answer objective\.

The effect is therefore task\-dependent\. For benchmarks requiring multi\-step visual exploration, such as V∗and BLINK, a small number of extra tokens can help integrate retrieved evidence\. However, larger budgets introduce more unsupervised slots, weakening grounding reliability and increasing optimization instability\. This degradation is more evident on HRBench8K and MMVP: the former requires highly reliable local evidence under extreme resolution, while the latter is mainly limited by the visual encoder’s ability\. Overall, the latent budget should match the steps of available supervision, enablingRISto expand latent reasoning without diluting grounded visual evidence\.

#### Entropy Dynamics of Latent Reasoning\.

Figure[4](https://arxiv.org/html/2605.07106#S5.F4)\(c\) visualizes the normalized reasoning entropy along the latent trajectory, with details provided in the Appendix[D](https://arxiv.org/html/2605.07106#A4)\. The supervised latent tokens maintain high entropy, suggesting they preserve an open visual reasoning space rather than being constrained to a single reasoning path\. The unsupervised slots show a slight entropy increase, indicating their role in aggregating and synthesizing evidence retrieved by earlier grounded tokens\. Following latent tokens, transition tokens bridge high\-entropy latent states to low\-entropy answer tokens, avoiding an abrupt representation jump\. This supports our design:visual evidence is first explored and integrated in the latent space, then transited back to the vocabulary decoding space\.

![Refer to caption](https://arxiv.org/html/2605.07106v1/x4.png)Figure 4:Latent Behavior Analysis ofRIS: Diversity, Reasoning Entropy, and Interpretability\.
#### Latent Token Diversity and Interpretability\.

Figure[4](https://arxiv.org/html/2605.07106#S5.F4)\(a\)\(b\) examines whether latent tokens collapse into instance\-agnostic patterns\. In both cross\-sample and within\-sample analyses,RISshows much lower similarity than LVR and Monet, indicating more step\-specific and instance\-dependent latent states\. Although within\-sample similarity is naturally higher due to shared visual content within the same image,RISremains notably diverse, suggesting that its latent tokens progressively organize grounded visual\-semantic evidence rather than repeatedly encoding the same features\. This effect is especially clear on BLINK, where repeated visual search and structured perception are required\. Cross\-sample similarity is overall lower, and the remaining similarity is likely due to similarity across examples; nevertheless,RISstill maintains substantially stronger diversity than prior methods\.

Figure[4](https://arxiv.org/html/2605.07106#S5.F4)\(d\) further provides a qualitative view\. The decoded bounding boxes form a clear step\-by\-step reasoning trajectory, and the associated visual information captures key evidence in each region\. The final synthesis tokens cover multiple semantically relevant regions and decode integrated visual information rather than isolated local details\. This confirms their role as an evidence integration stage, consistent with the entropy increase observed in Figure[4](https://arxiv.org/html/2605.07106#S5.F4)\(c\)\.

### 5\.4Ablation Study

![Refer to caption](https://arxiv.org/html/2605.07106v1/x5.png)Figure 5:Design Ablations\.\(Error bars denote accuracy standard deviation across repeated training runs\.\)To assess the contribution of each component inRIS, we conduct systematic ablations, with results summarized in Figure[5](https://arxiv.org/html/2605.07106#S5.F5)leading to three critical takeaways regarding latent visual reasoning\.

Spatial\-semantic supervision stabilizes grounded latent reasoning\.Removing either side\-head supervision degrades performance and increases variance, confirming the importance of explicit grounding\. Bounding\-box supervision is especially important on localization\-heavy benchmarks such as V∗and BLINK, showing that spatial anchoring is crucial for directing latent tokens to task\-relevant visual evidence\. Description supervision has a milder but consistent impact, suggesting that semantic alignment mainly regularizes the latent space and prevents drift from the vocabulary reasoning manifold\.

The progressive attention mask prevents latent bypassing and collapse\.Removing the progressive attention mask causes the largest performance drop\. Without this bottleneck, answer tokens can directly attend to the raw image and question, weakening the role of latent tokens\. The degradation further suggests that uncalibrated latent tokens are not merely ignored placeholders; they can interfere with prediction even when answer tokens still access the original inputs\. This is further supported by Figure[5](https://arxiv.org/html/2605.07106#S5.F5)\(b\), removing the mask also sharply increases both cross\-sample and within\-sample latent\-token similarity, revealing collapse toward generic representations\. Thus, the mask is essential for routing visual evidence through the latent trajectory and maintaining step\-specific latent computation\.

Transition tokens facilitate language\-grounded decoding\.Removing transition tokens also consistently hurts performance, indicating that the short\-answer bridge is important for mapping synthesized latent evidence back to language\-grounded decoding\. This supports our hypothesis that transition tokens reduce representation mismatch between the expressive latent visual manifold and the vocabulary\-aligned output space\.

## 6Conclusion

In this work, we proposeRIS, a spatial\-semantic grounded latent visual reasoning framework that enables MLLMs to reason over visual evidence within continuous latent states\. We show that existing latent visual reasoning methods suffer from weak grounding, trajectory collapse, and answer shortcuts\. To address these issues,RISanchors latent tokens with spatial\-semantic supervision, enforces their use through a progressive attention bottleneck, and introduces transition tokens to bridge latent reasoning back to language decoding phase\. Extensive experiments demonstrate consistent improvements over strong baselines\. Further analyses verify thatRISlearns more diverse, interpretable, and causally effective latent trajectories, suggesting a practical path to faithful internal visual reasoning in MLLMs\.

## References

- \[1\]S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin\(2025\)Qwen2\.5\-vl technical report\.External Links:2502\.13923,[Link](https://arxiv.org/abs/2502.13923)Cited by:[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p1.1)\.
- \[2\]\(2026\)LLM latent reasoning as chain of superposition\.External Links:2510\.15522,[Link](https://arxiv.org/abs/2510.15522)Cited by:[§2](https://arxiv.org/html/2605.07106#S2.p2.1)\.
- \[3\]X\. Fu, Y\. Hu, B\. Li, Y\. Feng, H\. Wang, X\. Lin, D\. Roth, N\. A\. Smith, W\. Ma, and R\. Krishna\(2024\)Blink: multimodal large language models can see but not perceive\.InEuropean Conference on Computer Vision,pp\. 148–166\.Cited by:[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p2.1)\.
- \[4\]T\. Gupta and A\. Kembhavi\(2023\)Visual programming: compositional visual reasoning without training\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 14953–14962\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[5\]S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian\(2024\)Training large language models to reason in a continuous latent space\.arXiv preprint arXiv:2412\.06769\.Cited by:[§2](https://arxiv.org/html/2605.07106#S2.p2.1),[§4\.4](https://arxiv.org/html/2605.07106#S4.SS4.p1.7)\.
- \[6\]Y\. Hu, O\. Stretcu, C\. Lu, K\. Viswanathan, K\. Hata, E\. Luo, R\. Krishna, and A\. Fuxman\(2024\)Visual program distillation: distilling tools and programmatic reasoning into vision\-language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9590–9601\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[7\]D\. A\. Hudson and C\. D\. Manning\(2019\)GQA: a new dataset for real\-world visual reasoning and compositional question answering\.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 6693–6702\.External Links:[Link](https://api.semanticscholar.org/CorpusID:152282269)Cited by:[§4\.1](https://arxiv.org/html/2605.07106#S4.SS1.p1.1)\.
- \[8\]T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa\(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[9\]B\. Li, X\. Sun, J\. Liu, Z\. Wang, J\. Wu, X\. Yu, H\. Chen, E\. Barsoum, M\. Chen, and Z\. Liu\(2025\)Latent visual reasoning\.arXiv preprint arXiv:2509\.24251\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p2.2),[§2](https://arxiv.org/html/2605.07106#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p3.1),[Table 2](https://arxiv.org/html/2605.07106#S5.T2.1.1.8.1)\.
- \[10\]J\. Li, D\. Li, C\. Xiong, and S\. Hoi\(2022\)Blip: bootstrapping language\-image pre\-training for unified vision\-language understanding and generation\.InInternational conference on machine learning,pp\. 12888–12900\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[11\]Y\. Li, C\. Chen, Y\. Li, F\. Zeng, K\. Huang, J\. Xu, and M\. Sun\(2026\)Imagination helps visual reasoning, but not yet in latent space\.arXiv preprint arXiv:2602\.22766\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p3.1),[§2](https://arxiv.org/html/2605.07106#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.07106#S3.SS3.p1.1)\.
- \[12\]S\. Liu, H\. Cheng, H\. Liu, H\. Zhang, F\. Li, T\. Ren, X\. Zou, J\. Yang, H\. Su, J\. Zhu,et al\.\(2024\)Llava\-plus: learning to use tools for creating multimodal agents\.InEuropean conference on computer vision,pp\. 126–142\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[13\]S\. Liu, Z\. Zeng, T\. Ren, F\. Li, H\. Zhang, J\. Yang, C\. Li, J\. Yang, H\. Su, J\. Zhu,et al\.\(2023\)Grounding dino: marrying dino with grounded pre\-training for open\-set object detection\.arXiv preprint arXiv:2303\.05499\.Cited by:[§4\.1](https://arxiv.org/html/2605.07106#S4.SS1.p1.1)\.
- \[14\]Y\. Qin, B\. Wei, J\. Ge, K\. Kallidromitis, S\. Fu, T\. Darrell, and X\. Wang\(2025\)Chain\-of\-visual\-thought: teaching vlms to see and think better with continuous visual tokens\.arXiv preprint arXiv:2511\.19418\.Cited by:[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p3.1),[Table 2](https://arxiv.org/html/2605.07106#S5.T2.1.1.10.1)\.
- \[15\]Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He\(2025\)Codi: compressing chain\-of\-thought into continuous space via self\-distillation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 677–693\.Cited by:[§2](https://arxiv.org/html/2605.07106#S2.p2.1)\.
- \[16\]Z\. Su, P\. Xia, H\. Guo, Z\. Liu, Y\. Ma, X\. Qu, J\. Liu, Y\. Li, K\. Zeng, Z\. Yang,et al\.\(2025\)Thinking with images for multimodal reasoning: foundations, methods, and future frontiers\.arXiv preprint arXiv:2506\.23918\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1),[§1](https://arxiv.org/html/2605.07106#S1.p2.2),[§2](https://arxiv.org/html/2605.07106#S2.p1.1)\.
- \[17\]D\. Surís, S\. Menon, and C\. Vondrick\(2023\)Vipergpt: visual inference via python execution for reasoning\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 11888–11898\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[18\]S\. Tong, Z\. Liu, Y\. Zhai, Y\. Ma, Y\. LeCun, and S\. Xie\(2024\)Eyes wide shut? exploring the visual shortcomings of multimodal llms\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9568–9578\.Cited by:[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p2.1)\.
- \[19\]Q\. Wang, Y\. Shi, Y\. Wang, Y\. Zhang, P\. Wan, K\. Gai, X\. Ying, and Y\. Wang\(2025\)Monet: reasoning in latent visual space beyond images and language\.arXiv preprint arXiv:2511\.21395\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p2.2),[§2](https://arxiv.org/html/2605.07106#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p3.1),[Table 2](https://arxiv.org/html/2605.07106#S5.T2.1.1.13.1.1),[Table 2](https://arxiv.org/html/2605.07106#S5.T2.1.1.9.1)\.
- \[20\]W\. Wang, L\. Ding, M\. Zeng, X\. Zhou, L\. Shen, Y\. Luo, W\. Yu, and D\. Tao\(2025\)Divide, conquer and combine: a training\-free framework for high\-resolution image perception in multimodal large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 7907–7915\.Cited by:[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p2.1)\.
- \[21\]Y\. Wang, S\. Wu, Y\. Zhang, S\. Yan, Z\. Liu, J\. Luo, and H\. Fei\(2025\)Multimodal chain\-of\-thought reasoning: a comprehensive survey\.arXiv preprint arXiv:2503\.12605\.Cited by:[§2](https://arxiv.org/html/2605.07106#S2.p1.1)\.
- \[22\]Z\. Wang, J\. Zhu, B\. Tang, Z\. Li, F\. Xiong, J\. Yu, and M\. B\. Blaschko\(2025\)Jigsaw\-r1: a study of rule\-based visual reinforcement learning with jigsaw puzzles\.arXiv preprint arXiv:2505\.23590\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[23\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1)\.
- \[24\]P\. Wu and S\. Xie\(2024\)V\*: guided visual search as a core mechanism in multimodal llms\.In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 13084–13094\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01243)Cited by:[§5\.1](https://arxiv.org/html/2605.07106#S5.SS1.p2.1)\.
- \[25\]Z\. Yang, X\. Yu, D\. Chen, M\. Shen, and C\. Gan\(2025\)Machine mental imagery: empower multimodal reasoning with latent visual tokens\.arXiv preprint arXiv:2506\.17218\.Cited by:[§2](https://arxiv.org/html/2605.07106#S2.p3.1)\.
- \[26\]Z\. Yang, L\. Li, J\. Wang, K\. Lin, E\. Azarnasab, F\. Ahmed, Z\. Liu, C\. Liu, M\. Zeng, and L\. Wang\(2023\)Mm\-react: prompting chatgpt for multimodal reasoning and action\.arXiv preprint arXiv:2303\.11381\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p1.1),[§2](https://arxiv.org/html/2605.07106#S2.p1.1)\.
- \[27\]X\. Yu, Z\. Chen, Y\. He, T\. Fu, C\. Yang, C\. Xu, Y\. Ma, X\. Hu, Z\. Cao, J\. Xu,et al\.\(2026\)The latent space: foundation, evolution, mechanism, ability, and outlook\.arXiv preprint arXiv:2604\.02029\.Cited by:[§1](https://arxiv.org/html/2605.07106#S1.p2.2)\.
- \[28\]Z\. Zhang, A\. Zhang, M\. Li, H\. Zhao, G\. Karypis, and A\. Smola\(2023\)Multimodal chain\-of\-thought reasoning in language models\.arXiv preprint arXiv:2302\.00923\.Cited by:[§2](https://arxiv.org/html/2605.07106#S2.p1.1)\.

## Appendix ALimitations

AlthoughRISprovides an effective framework for grounded latent visual reasoning, it still has several limitations\.

First, our framework involves multiple interacting components\. While each component is empirically essential, the overall framework introduces several hyperparameters and schedules whose optimal settings may depend on the backbone model, task type, and data distribution\. A more principled and automatic strategy for configuring these components would further improve the robustness and usability of our framework\.

Second,RISstill fixes the latent token budget, which is a common setting of most existing latent reasoning methods\. In our framework, the supervised latent tokens and unsupervised latent tokens provide an implicit form of adaptive computation, and our analyses suggest that different latent slots can spontaneously specialize into retrieval, integration, and synthesis roles\. Nevertheless, the total number of latent slots is still predefined rather than dynamically determined according to the complexity of each input\. Developing latent reasoning models with truly adaptive token allocation remains an important direction of our future work\.

Third, our experiments mainly focus on standard image\-based VQA and visual reasoning benchmarks commonly used for latent visual reasoning\. Although these benchmarks cover localization, structured visual search, and multi\-step perceptual reasoning, they do not fully reflect the diversity of real\-world multimodal reasoning scenarios\. Future work would further evaluate and extendRISto broader settings, such as open\-ended long\-form visual question answering, multi\-image reasoning, video reasoning, and interactive embodied tasks\.

## Appendix BTraining Hyperparameters

Table 3:Detailed Hyperparameters forRISTraining\.Training PhaseHyperparameterValueArchitectureBase ModelQwen2\.5\-VL\-7BLatent Tokens \(KK\)5Phase 1Epochs1 – 2Learning Rate1×10−51\\times 10^\{\-5\}Phase 2Sub\-stage 2\.A LR \(fr​e​gf\_\{reg\}\)5×10−45\\times 10^\{\-4\}Sub\-stage 2\.A LR \(fd​e​s​cf\_\{desc\}\)1×10−41\\times 10^\{\-4\}Sub\-stage 2\.B Epochs per Curriculum Stage1 – 2Sub\-stage 2\.B LR5×10−65\\times 10^\{\-6\}Phase 3Epochs1 – 2Learning Rate3×10−63\\times 10^\{\-6\}Mask Ratio Annealing0\.3→1\.00\.3\\rightarrow 1\.0Loss Weightsλr\\lambda\_\{r\}\(Region Grounding\)1\.0λg​i​o​u\\lambda\_\{giou\}\(GIoU within Region Grounding\)2\.0λd\\lambda\_\{d\}\(Region Description Alignment\)0\.5α\\alpha\(Textual CoT\)1\.0Table 4:Additional hyperparameters for theRIS\+VLPO reinforcement learning stage\.HyperparameterValueInitializationPhase\-3RIScheckpointRL data3\.2k/6\.4ksamples from GLSDLatent actionRISlatent slot hidden statectc\_\{t\}Latent tokensK=5K=5Attention maskPhase\-3 bottleneck mask, fixed atρ=1\.0\\rho=1\.0Reference modelFrozen Phase\-3RIScheckpointRollouts per promptG=4G=4Policy epochs per batch1RL epochs1Global prompt batch size32Learning rate1×10−61\\times 10^\{\-6\}KL coefficientβ=0\.02\\beta=0\.02Latent likelihood scaleσ=1\.0\\sigma=1\.0Max response length128 tokensSampling temperature / top\-pp1\.0 / 0\.95Rewardracc\+0\.1​rfmtr\_\{\\rm acc\}\+0\.1r\_\{\\rm fmt\}Accuracy rewardraccr\_\{\\rm acc\}1 if normalized answer matches ground truth, else 0Format rewardrfmtr\_\{\\rm fmt\}1 if final answer is extractable, else 0Latent\-use rewardNoneGrounding regularization0\.1​ℒreg\+0\.05​ℒdesc0\.1\\mathcal\{L\}\_\{\\rm reg\}\+0\.05\\mathcal\{L\}\_\{\\rm desc\}Trainable modulesMLLM backbone; side heads frozen
## Appendix CCompute Resources

All experiments were conducted on a single server equipped with an Intel Xeon Platinum 8383C CPU, 512GB DDR5 RAM, and 4 NVIDIA A100 80GB GPUs connected with NVLink\. We trained on an 80k\-sampleGrounded Latent Supervision Dataset\(GLSD\) with a global batch size of 32, corresponding to 2,500 optimization steps for each full data pass\. Table[5](https://arxiv.org/html/2605.07106#A3.T5)reports the estimated wall\-clock training time for each stage ofRIS\. For Phase 2b, we assume a latent budget ofK=5K=5and one full data pass for each curriculum stage; its cost scales approximately linearly withKK\.

Table 5:Recorded training time of different stages on 4 NVIDIA A100 80GB GPUs\.Training StageTrainable ComponentsUpdate StepsWall\-clock TimePhase 1Backbone2\.52\.5K2\.52\.5–3\.03\.0hPhase 2aSide heads2\.52\.5K2\.02\.0–2\.52\.5hPhase 2bBackbone \+ side heads12\.512\.5K1616–2020hPhase 3Backbone \+ side heads2\.52\.5K6\.06\.0–7\.57\.5hTotal–20\.020\.0K26\.526\.5–3333h
## Appendix DDetails of Reasoning Entropy Estimation

We provide a detailed definition and implementation of*Reasoning Entropy*in this section\. For latent tokens, we consider two possible estimators\. The first estimator,*Semantic Region Similarity as Reasoning Entropy*, offers a direct and strict way to quantify how much grounded visual evidence is encoded in each latent token by measuring its compatibility with multiple semantic evidence regions\. However, it depends on the semantic decoderfdescf\_\{\\mathrm\{desc\}\}calibrated for latent states and cannot be directly applied to language tokens due to their next\-token prediction nature\. Therefore, we use it only as a reference analysis for latent tokens\.

The second estimator,*Intervention\-based Visual Evidence Importance as Reasoning Entropy*, places latent tokens and language tokens in the same space\. It therefore enables a unified computation and comparison of*Reasoning Entropy*across latent tokens, transition tokens, and answer tokens\. In practice, we found that the entropy of latent tokens computed by the two estimators exhibits highly consistent trends\. Therefore, unless otherwise specified, we adopt the intervention\-based estimator as the unified definition of*Reasoning Entropy*throughout our analysis\.

### D\.1Dataset Preparation

For each image–question pair\(I,q\)\(I,q\), we construct an analysis\-only bank of diverse grounded reasoning traces\. Following the same data construction protocol used for our step\-wise grounded supervision, we prompt a strong MLLM to sample multiple visual reasoning traces\. Each trace contains a sequence of grounded evidence steps, and each step consists of a bounding box, a region\-specific semantic description, and the corresponding visual information\. We retain only traces whose final answers match the ground\-truth answer and whose grounded boxes pass the verification procedure\. This yields an image\-question\-specific trace bank:

𝒯​\(I,q\)=\{𝒯m=\{\(bm,s,dm,s,vm,s\)\}s=1Sm\}m=1M,\\mathcal\{T\}\(I,q\)=\\left\\\{\\mathcal\{T\}\_\{m\}=\\\{\(b\_\{m,s\},d\_\{m,s\},v\_\{m,s\}\)\\\}\_\{s=1\}^\{S\_\{m\}\}\\right\\\}\_\{m=1\}^\{M\},wheremmindexes a valid reasoning trace,ssindexes an evidence step,bm,sb\_\{m,s\}is the verified bounding box,dm,sd\_\{m,s\}is the region description, andvm,sv\_\{m,s\}is the extracted visual information\.

We encode each evidence step into the same semantic space used by the semantic alignment decoder:

em,s=g​\(dm,s⊕vm,s\),e\_\{m,s\}=g\\left\(d\_\{m,s\}\\oplus v\_\{m,s\}\\right\),whereg​\(⋅\)g\(\\cdot\)denotes the frozen semantic encoder used to build stable region\-level semantic anchors, and⊕\\oplusdenotes textual concatenation\. The resulting set

ℬ​\(I,q\)=\{em,s\}m=1,s=1M,Sm\\mathcal\{B\}\(I,q\)=\\\{e\_\{m,s\}\\\}\_\{m=1,s=1\}^\{M,S\_\{m\}\}serves as an image\-question\-specific bank of answer\-relevant visual evidence\.

### D\.2Semantic Region Similarity as Reasoning Entropy

#### Probing latent token states in the semantic space\.

For a latent visual reasoning model, we collect the last\-layer hidden states along the latent trajectory, including both supervised and unsupervised latent tokens\. For a hidden statehth\_\{t\}, we use the calibrated semantic alignment decoder as a probing interface:

dt=fdesc​\(ht\)\.d\_\{t\}=f\_\{\\mathrm\{desc\}\}\(h\_\{t\}\)\.
We compute the similarity betweendtd\_\{t\}and each evidence step inℬ​\(I,q\)\\mathcal\{B\}\(I,q\), and normalize the similarities into an evidence\-step distribution:

pm,s\(t\)=exp⁡\(cos⁡\(dt,em,s\)/τ\)∑m′=1M∑s′=1Sm′exp⁡\(cos⁡\(dt,em′,s′\)/τ\),\\small p\_\{m,s\}^\{\(t\)\}=\\frac\{\\exp\\left\(\\cos\(d\_\{t\},e\_\{m,s\}\)/\\tau\\right\)\}\{\\sum\_\{m^\{\\prime\}=1\}^\{M\}\\sum\_\{s^\{\\prime\}=1\}^\{S\_\{m^\{\\prime\}\}\}\\exp\\left\(\\cos\(d\_\{t\},e\_\{m^\{\\prime\},s^\{\\prime\}\}\)/\\tau\\right\)\},whereτ\\tauis a temperature parameter\.

#### Normalized reasoning entropy\.

Our goal is to measure whether a latent state remains compatible with multiple plausible grounded reasoning traces, we aggregate the evidence\-step distribution into a trace\-level distribution:

Pm\(t\)=∑s=1Smpm,s\(t\)\.\\small P\_\{m\}^\{\(t\)\}=\\sum\_\{s=1\}^\{S\_\{m\}\}p\_\{m,s\}^\{\(t\)\}\.The reasoning entropy of tokenttis then defined as

Hreason​\(ht\)=−∑m=1MPm\(t\)​log⁡Pm\(t\)\.\\small H\_\{\\mathrm\{reason\}\}\(h\_\{t\}\)=\-\\sum\_\{m=1\}^\{M\}P\_\{m\}^\{\(t\)\}\\log P\_\{m\}^\{\(t\)\}\.To make values comparable across samples with different numbers of retained valid traces, we report the normalized reasoning entropy:

H~reason​\(ht\)=Hreason​\(ht\)log⁡M\.\\small\\widetilde\{H\}\_\{\\mathrm\{reason\}\}\(h\_\{t\}\)=\\frac\{H\_\{\\mathrm\{reason\}\}\(h\_\{t\}\)\}\{\\log M\}\.Thus,H~reason​\(ht\)∈\[0,1\]\\widetilde\{H\}\_\{\\mathrm\{reason\}\}\(h\_\{t\}\)\\in\[0,1\]\. A higher value indicates that the hidden state remains semantically compatible with multiple valid grounded reasoning traces, while a lower value indicates that the state has concentrated on a smaller set of reasoning possibilities\.

This probing analysis should not be interpreted as decoding multiple bounding boxes from one latent token\. The visual grounding decoderfregf\_\{\\mathrm\{reg\}\}still predicts a single supervised bounding box for each grounded latent token\. The entropy is instead estimated through the semantic alignment decoderfdescf\_\{\\mathrm\{desc\}\}, which probes whether the hidden state is semantically close to multiple valid grounded reasoning traces beyond its explicitly decoded spatial output\.

### D\.3Intervention\-based Visual Evidence Importance as Reasoning Entropy

The semantic probing entropy above measures whether a hidden state is compatible with multiple grounded reasoning traces in the learned semantic evidence space\. Although more intuitive, this probing interface relies on the calibrated semantic decoderfdescf\_\{\\mathrm\{desc\}\}, which is trained on latent states and is therefore not directly comparable for ordinary language states\. Another possible choice is to compute entropy from the language\-modeling distribution, i\.e\.,

Hvocab​\(t\)=−∑v∈𝒱pθ​\(v∣ut\)​log⁡pθ​\(v∣ut\),H\_\{\\mathrm\{vocab\}\}\(t\)=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{\\theta\}\(v\\mid u\_\{t\}\)\\log p\_\{\\theta\}\(v\\mid u\_\{t\}\),wherepθ​\(v∣ut\)p\_\{\\theta\}\(v\\mid u\_\{t\}\)is obtained by applying the LM head and softmax to the hidden state\. However, this quantity measures next\-token prediction uncertainty over the vocabulary rather than the amount of visual evidence represented by the current state\. For example, a transition token may have low vocabulary entropy simply because the next word is syntactically predictable, even if its hidden state still depends on multiple visual regions\. Conversely, function words or ambiguous lexical continuations may yield high vocabulary entropy without indicating broad visual grounding\. Moreover, vocabulary entropy is not naturally defined for non\-linguistic latent slots, making it unsuitable for comparing latent tokens and language tokens in a shared representational space\.

To fairly compare latent tokens, transition tokens, and answer tokens, we therefore estimate entropy through an intervention\-based visual evidence distribution, which does not require any additional projection head or vocabulary\-level decoding\. This estimate evaluates all analyzed states in the same image\-question\-specific evidence space by measuring how their representations change when each grounded visual evidence region is removed\.

For each image–question pair\(I,q\)\(I,q\), we use the grounded evidence bankℬ​\(I,q\)\\mathcal\{B\}\(I,q\)defined above\. Each evidence node corresponds to a verified regionbkb\_\{k\}from the retained grounded traces, wherek∈\{1,…,KI,q\}k\\in\\\{1,\\ldots,K\_\{I,q\}\\\}\. To avoid artificially increasing entropy due to repeated boxes across different traces, we merge highly overlapping evidence regions using non\-maximum suppression and keep the merged regions as the intervention units\. Letutu\_\{t\}denote the last\-layer hidden state at trajectory positiontt, whereutu\_\{t\}can be a supervised latent token, an unsupervised latent token, or to generate a transition token or an answer token\. The model first inferences on the original image and record the full\-state representationutfullu\_\{t\}^\{\\mathrm\{full\}\}\.

We then construct one counterfactual input for each evidence node by masking the corresponding image regionbkb\_\{k\}, producingI\(−k\)I^\{\(\-k\)\}\. The same question, latent\-token layout, transition tokens, and answer tokens are used under teacher forcing, so that token positions are aligned across the original and counterfactual forward passes\. This yields a counterfactual representationut\(−k\)u\_\{t\}^\{\(\-k\)\}for every analyzed state, and the difference betweenutfullu\_\{t\}^\{\\mathrm\{full\}\}andut\(−k\)u\_\{t\}^\{\(\-k\)\}reflects the effect of removing visual evidence regionbkb\_\{k\}, rather than changes in the generated token sequence\. The visual sensitivity of stateutu\_\{t\}to evidence nodekkis defined as

st,k=max⁡\(0,1−cos⁡\(LN​\(utfull\),LN​\(ut\(−k\)\)\)\),s\_\{t,k\}=\\max\\left\(0,\\,1\-\\cos\\left\(\\mathrm\{LN\}\(u\_\{t\}^\{\\mathrm\{full\}\}\),\\mathrm\{LN\}\(u\_\{t\}^\{\(\-k\)\}\)\\right\)\\right\),whereLN​\(⋅\)\\mathrm\{LN\}\(\\cdot\)denotes the same final hidden\-state normalization used before the language modeling head\. A largerst,ks\_\{t,k\}indicates that removing regionbkb\_\{k\}induces a larger change in the token state, suggesting that the state depends more strongly on this visual evidence node\.

We normalize the sensitivities over all evidence nodes to obtain an interventional visual evidence distribution:

qt,k=st,k\+ϵ∑r=1KI,q\(st,r\+ϵ\),q\_\{t,k\}=\\frac\{s\_\{t,k\}\+\\epsilon\}\{\\sum\_\{r=1\}^\{K\_\{I,q\}\}\(s\_\{t,r\}\+\\epsilon\)\},whereϵ\\epsilonis a small constant for numerical stability\. The corresponding evidence entropy is

HIVE​\(ut\)=−∑k=1KI,qqt,k​log⁡qt,k\.\\small H\_\{\\mathrm\{IVE\}\}\(u\_\{t\}\)=\-\\sum\_\{k=1\}^\{K\_\{I,q\}\}q\_\{t,k\}\\log q\_\{t,k\}\.Since different samples may contain different numbers of retained evidence nodes, we report normalized entropy:

H~IVE​\(ut\)=HIVE​\(ut\)log⁡KI,q\.\\small\\widetilde\{H\}\_\{\\mathrm\{IVE\}\}\(u\_\{t\}\)=\\frac\{H\_\{\\mathrm\{IVE\}\}\(u\_\{t\}\)\}\{\\log K\_\{I,q\}\}\.
A potential issue is that visually inactive tokens, such as function words or punctuation, may have uniformly small sensitivities to all evidence regions\. Such tokens can obtain spuriously high normalized entropy after normalization\. We therefore compute the total visual sensitivity mass

Mt=∑k=1KI,qst,k,\\small M\_\{t\}=\\sum\_\{k=1\}^\{K\_\{I,q\}\}s\_\{t,k\},and use a mass\-aware entropy score:

H^IVE​\(ut\)=MtMt\+α⋅H~IVE​\(ut\),\\small\\widehat\{H\}\_\{\\mathrm\{IVE\}\}\(u\_\{t\}\)=\\frac\{M\_\{t\}\}\{M\_\{t\}\+\\alpha\}\\cdot\\widetilde\{H\}\_\{\\mathrm\{IVE\}\}\(u\_\{t\}\),whereα\\alphais set to the median visual sensitivity mass over all analyzed states on the validation subset\. This weighting preserves high entropy only when the token state is both visually grounded and broadly influenced by multiple evidence nodes\.

#### Trajectory\-level aggregation\.

We computeH^IVE\\widehat\{H\}\_\{\\mathrm\{IVE\}\}for all states along the latent\-to\-answer trajectory\. Fixed latent slots are averaged by slot index across samples\. Since the number of transition and answer tokens may vary across examples, we align textual positions by their normalized phase position and average them into fixed\-width bins\. Special tokens and padding tokens are excluded\. The plotted curve reports the mean entropy over samples, with the latent phase, transition\-token phase, and answer\-token phase shown in order\.

This intervention\-based estimate places latent tokens and language tokens in the same evidence space: both are evaluated by how their hidden states causally respond to removing each answer\-relevant visual region\. Therefore, the entropy gap should be interpreted as evidence\-dispersion rather than next\-token uncertainty\. High values indicate that a state remains sensitive to multiple grounded visual evidence nodes, while low values indicate that the state has concentrated on a narrower set of evidence required for final vocabulary\-aligned decoding\.

## Appendix EDataset Statistics and Visualization

Tables[6](https://arxiv.org/html/2605.07106#A5.T6)–[9](https://arxiv.org/html/2605.07106#A5.T9)summarize the overall scale, reasoning\-step distribution, JSONL sample schema, and per\-step annotation format of GLSD\. Figure[6](https://arxiv.org/html/2605.07106#A5.F6)–[8](https://arxiv.org/html/2605.07106#A5.F8)provide visualization of GLSD\.

Table 6:Statistics of theGrounded Latent Supervision Dataset\(GLSD\)\.ItemValueSource datasetGQA train splitStorage formatJSONLParseable samples96,000Reasoning steps per sample2–4Total grounded reasoning steps256,444Average reasoning steps2\.67Spatial supervisionNormalized and pixel\-level bounding boxesSemantic supervisionRegion descriptions and visual informationAnswer fieldsfull answer and final answerTable 7:Distribution of reasoning\-chain lengths in GLSD\.Reasoning\-chain lengthNumber of samplesPercentage2 steps47,41049\.39%3 steps32,73634\.10%4 steps15,85416\.51%Total96,000100\.00%Table 8:Top\-level fields in each GLSD JSONL sample\.FieldTypeDescriptionquestionstringQuestion textanswerstringfinal answer phrasefull\_answerstringelaborated answer \(transition\)imagestringGQA image filenamewidth,heightintImage resolutiondataset,splitstringSource dataset and splitreasoning\_chainarrayStep\-wise grounded reasoning traceannotation\_maskarray\[int\]Valid\-step mask padded to budgetKKKintLatent\-slot budget in the stored samplereasoning\_chain\_viz\_filestringOptional visualization filenameTable 9:Fields of each reasoning step inreasoning\_chain\.FieldTypeDescriptionstepintStep index starting from 1operationstringStep type, e\.g\.,locate,inspect,verifybbox\_01array\[float\]Normalized box\[x1,y1,x2,y2\]\[x\_\{1\},y\_\{1\},x\_\{2\},y\_\{2\}\]in\[0,1\]\[0,1\]bbox\_pixelsarray\[int\]Pixel\-space bounding boxregion\_descriptionstringDescription of the attended regionvisual\_informationstringVisual evidence extracted from the region![Refer to caption](https://arxiv.org/html/2605.07106v1/x6.png)Figure 6:Example of 4\-steps Supervision Sample\.![Refer to caption](https://arxiv.org/html/2605.07106v1/x7.png)Figure 7:Example of 3\-steps Supervision Sample\.![Refer to caption](https://arxiv.org/html/2605.07106v1/x8.png)Figure 8:Example of 2\-steps Supervision Sample\.

Similar Articles

Thinking with Visual Grounding

Hugging Face Daily Papers

This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Hugging Face Daily Papers

This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.