xi-an-jiaotong-university

#xi-an-jiaotong-university

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.

0 favorites 0 likes

xi-an-jiaotong-university

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Submit Feedback