Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
Summary
Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.
View Cached Full Text
Cached at: 05/22/26, 08:45 AM
# Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention Source: [https://arxiv.org/abs/2605.22072](https://arxiv.org/abs/2605.22072) [View PDF](https://arxiv.org/pdf/2605.22072) > Abstract:Reinforcement learning with verifiable rewards \(RLVR\) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models \(MLLMs\)\. This transfer, however, surfaces a faithfulness challenge: faithful perception of task\-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks\. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception\-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning\. To close these gaps, we propose Faithful\-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning\. The Anchoring stage turns perception into an explicit pre\-reasoning subtask, supervising a dedicated <Focus\> token's attention directly against image regions rather than through textual descriptions\. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer\-correct trajectories that concentrate visual attention where vision causally matters\. Extensive experiments demonstrate that Faithful\-MR1 outperforms recent multimodal reasoning baselines on both Qwen2\.5\-VL\-Instruct 3B and 7B backbones while using substantially less training data\. ## Submission history From: Changyuan Tian \[[view email](https://arxiv.org/show-email/eb2b2537/2605.22072)\] **\[v1\]**Thu, 21 May 2026 07:10:18 UTC \(7,377 KB\)
Similar Articles
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
AtManRL is a method that uses differentiable attention manipulation and reinforcement learning to train LLMs to generate more faithful chain-of-thought reasoning by ensuring reasoning tokens causally influence final predictions. Experiments on GSM8K and MMLU with Llama-3.2-3B demonstrate the approach can identify influential reasoning tokens and improve reasoning transparency.
MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning
This paper introduces MathVis-Fine, a framework for fine-grained visual dependency modeling in multimodal mathematical reasoning, along with a new dataset and a two-stage progressive training paradigm that balances answer correctness and visual grounding rewards based on each sample's intrinsic visual dependency level.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
Reinforcing Multimodal Reasoning Against Visual Degradation
This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.