Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning
Summary
Introduces Perceive-to-Reason (P2R), a framework that decouples visual perception from reasoning in vision-language models using a two-stage process and a role-aware reinforcement learning strategy, achieving state-of-the-art results on fine-grained visual reasoning benchmarks.
View Cached Full Text
Cached at: 07/02/26, 03:46 AM
Paper page - Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning
Source: https://huggingface.co/papers/2607.01191 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A unified framework named Perceive-to-Reason (P2R) is introduced that separates visual perception from reasoning in vision-language models through a two-stage process, improving fine-grained visual reasoning performance on high-resolution images.
Fine-grained visual reasoningremains challenging forvision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulatesfine-grained visual reasoningas a two-stage process: the model first localizes question-relevant evidence as aPerceiver, and then answers the question as aReasonerbased on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introducePerception-Reasoning Alternating GRPO(PRA-GRPO), a role-awarereinforcement learningstrategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broadermultimodal reasoningtasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework forfine-grained visual reasoning.
View arXiv pageView PDFGitHubAdd to collection
Get this paper in your agent:
hf papers read 2607\.01191
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper3
#### hongxingli/P2R-4B Image-Text-to-Text• 5B• Updated43 minutes ago
#### hongxingli/P2R-2B Image-Text-to-Text• 2B• Updated43 minutes ago
#### hongxingli/P2R-8B Image-Text-to-Text• 9B• Updated42 minutes ago
Datasets citing this paper1
#### hongxingli/P2R-10k Viewer• Updated41 minutes ago • 10k • 9
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.01191 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
This paper proposes a staged training approach for vision-language models that separates visual perception, visual reasoning, and textual reasoning into distinct stages. The method improves visual reasoning accuracy while reducing reasoning trace length, demonstrating that stronger perception reduces the need for excessive reasoning.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.
PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking
PixelEyes proposes a multi-turn visual reasoning agent that decouples perception and reasoning using mask-guided search and semantic-region breadth-first search, introducing a new benchmark (Pinpoint-Bench) and dataset (PixelEyes-6K) to improve localization in visual evidence seeking.
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.