Reinforcing Multimodal Reasoning Against Visual Degradation
Summary
This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.
View Cached Full Text
Cached at: 05/12/26, 07:34 AM
Paper page - Reinforcing Multimodal Reasoning Against Visual Degradation
Source: https://huggingface.co/papers/2605.09262
Abstract
ROMA is an RL fine-tuning framework that enhances multimodal large language models’ robustness against visual degradations while maintaining performance on clean inputs through a dual-forward-pass strategy and specialized regularization techniques.
Reinforcement Learninghas significantly advanced the reasoning capabilities ofMultimodal Large Language Models(MLLMs), yet the resulting policies remain brittle against real-worldvisual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout inducesreward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning againstvisual degradationwhile preserving clean-input performance. A dual-forward-pass strategy usesteacher forcingto evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply atoken-level surrogate KL penaltyagainst the worst-case augmentation; to preventpolicy collapseunder regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance,correctness-conditioned regularizationrestricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions overGRPOwhile matching clean accuracy.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.09262
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09262 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09262 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09262 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Robust-U1 is a framework that enables multimodal large language models (MLLMs) to self-recover corrupted visual content using supervised fine-tuning, reinforcement learning with dual rewards, and joint multimodal reasoning, achieving state-of-the-art robustness on corruption benchmarks.
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
Improving Multimodal Reasoning via Worst Dimension Optimization
This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.