Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

multimodal-llm reinforcement-learning robustness visual-degradation fine-tuning roma

Summary

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.

Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

Original Article

View Cached Full Text

Cached at: 05/12/26, 07:34 AM

Paper page - Reinforcing Multimodal Reasoning Against Visual Degradation

Source: https://huggingface.co/papers/2605.09262

Abstract

ROMA is an RL fine-tuning framework that enhances multimodal large language models’ robustness against visual degradations while maintaining performance on clean inputs through a dual-forward-pass strategy and specialized regularization techniques.

Reinforcement Learninghas significantly advanced the reasoning capabilities ofMultimodal Large Language Models(MLLMs), yet the resulting policies remain brittle against real-worldvisual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout inducesreward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning againstvisual degradationwhile preserving clean-input performance. A dual-forward-pass strategy usesteacher forcingto evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply atoken-level surrogate KL penaltyagainst the worst-case augmentation; to preventpolicy collapseunder regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance,correctness-conditioned regularizationrestricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions overGRPOwhile matching clean accuracy.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.09262

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09262 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09262 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09262 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Reinforcing Multimodal Reasoning Against Visual Degradation

Paper page - Reinforcing Multimodal Reasoning Against Visual Degradation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Improving Multimodal Reasoning via Worst Dimension Optimization

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Submit Feedback

Similar Articles

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Improving Multimodal Reasoning via Worst Dimension Optimization

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training