Tag
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.