Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

arXiv cs.CL 05/22/26, 04:00 AM Papers

Summary

Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.

arXiv:2605.22072v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

Original Article

View Cached Full Text

Cached at: 05/22/26, 08:45 AM

# Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
Source: [https://arxiv.org/abs/2605.22072](https://arxiv.org/abs/2605.22072)
[View PDF](https://arxiv.org/pdf/2605.22072)

> Abstract:Reinforcement learning with verifiable rewards \(RLVR\) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models \(MLLMs\)\. This transfer, however, surfaces a faithfulness challenge: faithful perception of task\-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks\. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception\-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning\. To close these gaps, we propose Faithful\-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning\. The Anchoring stage turns perception into an explicit pre\-reasoning subtask, supervising a dedicated <Focus\> token's attention directly against image regions rather than through textual descriptions\. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer\-correct trajectories that concentrate visual attention where vision causally matters\. Extensive experiments demonstrate that Faithful\-MR1 outperforms recent multimodal reasoning baselines on both Qwen2\.5\-VL\-Instruct 3B and 7B backbones while using substantially less training data\.

## Submission history

From: Changyuan Tian \[[view email](https://arxiv.org/show-email/eb2b2537/2605.22072)\] **\[v1\]**Thu, 21 May 2026 07:10:18 UTC \(7,377 KB\)

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Similar Articles

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Reinforcing Multimodal Reasoning Against Visual Degradation

Submit Feedback

Similar Articles

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Reinforcing Multimodal Reasoning Against Visual Degradation