Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

arXiv cs.CL Papers

Summary

Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.

arXiv:2605.22072v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:45 AM

# Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
Source: [https://arxiv.org/abs/2605.22072](https://arxiv.org/abs/2605.22072)
[View PDF](https://arxiv.org/pdf/2605.22072)

> Abstract:Reinforcement learning with verifiable rewards \(RLVR\) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models \(MLLMs\)\. This transfer, however, surfaces a faithfulness challenge: faithful perception of task\-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks\. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception\-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning\. To close these gaps, we propose Faithful\-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning\. The Anchoring stage turns perception into an explicit pre\-reasoning subtask, supervising a dedicated <Focus\> token's attention directly against image regions rather than through textual descriptions\. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer\-correct trajectories that concentrate visual attention where vision causally matters\. Extensive experiments demonstrate that Faithful\-MR1 outperforms recent multimodal reasoning baselines on both Qwen2\.5\-VL\-Instruct 3B and 7B backbones while using substantially less training data\.

## Submission history

From: Changyuan Tian \[[view email](https://arxiv.org/show-email/eb2b2537/2605.22072)\] **\[v1\]**Thu, 21 May 2026 07:10:18 UTC \(7,377 KB\)

Similar Articles

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

arXiv cs.CL

AtManRL is a method that uses differentiable attention manipulation and reinforcement learning to train LLMs to generate more faithful chain-of-thought reasoning by ensuring reasoning tokens causally influence final predictions. Experiments on GSM8K and MMLU with Llama-3.2-3B demonstrate the approach can identify influential reasoning tokens and improve reasoning transparency.

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.