Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Hugging Face Daily Papers Papers

Summary

Introduces Perceive-to-Reason (P2R), a framework that decouples visual perception from reasoning in vision-language models using a two-stage process and a role-aware reinforcement learning strategy, achieving state-of-the-art results on fine-grained visual reasoning benchmarks.

Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.
Original Article
View Cached Full Text

Cached at: 07/02/26, 03:46 AM

Paper page - Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Source: https://huggingface.co/papers/2607.01191 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

A unified framework named Perceive-to-Reason (P2R) is introduced that separates visual perception from reasoning in vision-language models through a two-stage process, improving fine-grained visual reasoning performance on high-resolution images.

Fine-grained visual reasoningremains challenging forvision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulatesfine-grained visual reasoningas a two-stage process: the model first localizes question-relevant evidence as aPerceiver, and then answers the question as aReasonerbased on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introducePerception-Reasoning Alternating GRPO(PRA-GRPO), a role-awarereinforcement learningstrategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broadermultimodal reasoningtasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework forfine-grained visual reasoning.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2607\.01191

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper3

#### hongxingli/P2R-4B Image-Text-to-Text• 5B• Updated43 minutes ago #### hongxingli/P2R-2B Image-Text-to-Text• 2B• Updated43 minutes ago #### hongxingli/P2R-8B Image-Text-to-Text• 9B• Updated42 minutes ago

Datasets citing this paper1

#### hongxingli/P2R-10k Viewer• Updated41 minutes ago • 10k • 9

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.01191 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.