Thinking with Visual Grounding
Summary
This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.
View Cached Full Text
Cached at: 06/20/26, 02:30 PM
Paper page - Thinking with Visual Grounding
Source: https://huggingface.co/papers/2606.16122
Abstract
Visually grounded thinking integrates natural-language reasoning with explicit visual evidence grounding in vision-language models, improving reasoning accuracy through scalable synthesis and reinforcement learning techniques.
Visual thinking should not only sound right; it should show its evidence. While recentvision-language models(VLMs) can producenatural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point orbox groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correctvisual reasoningtraces, extracts the visual objects required by the traces, grounds them with aSAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further proposegrounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across twocounting benchmarksand fourspatial reasoningbenchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. Onspatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows thatpoint groundingis well suited to counting, whilebox groundingbenefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2606\.16122
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.16122 in a model README.md to link it from this page.
Datasets citing this paper1
#### JunkaiZ/TVG Viewer• Updated1 day ago • 19.7k • 31
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.16122 in a Space README.md to link it from this page.
Collections including this paper2
Similar Articles
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.