Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Hugging Face Daily Papers Papers

Summary

Robust-U1 is a framework that enables multimodal large language models (MLLMs) to self-recover corrupted visual content using supervised fine-tuning, reinforcement learning with dual rewards, and joint multimodal reasoning, achieving state-of-the-art robustness on corruption benchmarks.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.
Original Article
View Cached Full Text

Cached at: 06/12/26, 06:51 AM

Paper page - Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Source: https://huggingface.co/papers/2606.08063

Abstract

Robust-U1 enhances multimodal large language models’ robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance.

Multimodal Large Language Models(MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-worldvisual corruptions. While existingrobustness enhancementapproaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicitvisual self-recoverycapability for robust understanding. The approach comprises three core stages:supervised fine-tuningfor initial reconstruction,reinforcement learningwithdual rewards(pixel-level SSIMandsemantic-level CLIP similarity) for aligning high visual quality, andmultimodal reasoningthat jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

View arXiv pageView PDFProject pageGitHub13Add to collection

Community

Paper submitter

about 3 hours ago

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding.

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

Get this paper in your agent:

hf papers read 2606\.08063

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper3

#### Jiaqi-hkust/Robust-U1-SFT 15B• Updatedabout 3 hours ago • 1 #### Jiaqi-hkust/Robust-U1-RL 15B• Updatedabout 3 hours ago • 1 #### Jiaqi-hkust/Robust-U1 15B• Updatedabout 3 hours ago • 6 • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.08063 in a dataset README.md to link it from this page.

Spaces citing this paper1

Collections including this paper1

Similar Articles

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Hugging Face Daily Papers

The paper introduces BalCapRL, a balanced reinforcement learning framework for multimodal large language models that jointly optimizes correctness, coverage, and linguistic quality in image captioning. It demonstrates improved performance over existing methods by addressing trade-offs between utility and fluency through reward decoupling and length-conditional masking.