ChatPaper.aiChatPaper

Robust-U1: MLLMは破損した視覚コンテンツを自己回復し、堅牢な理解を実現できるか?

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

June 6, 2026
著者: Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen
cs.AI

要旨

多模态大语言模型(MLLMs)在视觉理解方面展现出显著成功,但在真实世界的视觉受损条件下,其性能会大幅下降。尽管现有鲁棒性增强方法存在,但具有局限性:黑盒特征对齐缺乏可解释性,而白盒基于文本的推理无法恢复丢失的像素级细节。本研究探讨一个基础性问题:多模态大语言模型能否自主恢复受损的视觉内容?为此,我们提出Robust-U1,一个新颖的框架,赋予多模态大语言模型明确的视觉自我恢复能力,以实现鲁棒理解。该方法包含三个核心阶段:监督微调用于初始重建,基于双重奖励(像素级SSIM和语义级CLIP相似度)的强化学习用于对齐高视觉质量,以及同时考虑受损输入与恢复图像的多模态推理。大量实验表明,Robust-U1在真实世界受损基准测试中达到最先进的鲁棒性,并且在通用VQA基准测试中,面对对抗性受损仍保持优越性能。分析证实,高质量的视觉恢复直接提升了推理性能,将自我恢复确立为鲁棒视觉理解的关键机制。源代码已公开于https://github.com/jqtangust/Robust-U1。
English
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.