Robust-U1: MLLMは破損した視覚コンテンツを自己回復し、堅牢な理解を実現できるか？

要旨

多模态大语言模型（MLLMs）在视觉理解方面展现出显著成功，但在真实世界的视觉受损条件下，其性能会大幅下降。尽管现有鲁棒性增强方法存在，但具有局限性：黑盒特征对齐缺乏可解释性，而白盒基于文本的推理无法恢复丢失的像素级细节。本研究探讨一个基础性问题：多模态大语言模型能否自主恢复受损的视觉内容？为此，我们提出Robust-U1，一个新颖的框架，赋予多模态大语言模型明确的视觉自我恢复能力，以实现鲁棒理解。该方法包含三个核心阶段：监督微调用于初始重建，基于双重奖励（像素级SSIM和语义级CLIP相似度）的强化学习用于对齐高视觉质量，以及同时考虑受损输入与恢复图像的多模态推理。大量实验表明，Robust-U1在真实世界受损基准测试中达到最先进的鲁棒性，并且在通用VQA基准测试中，面对对抗性受损仍保持优越性能。分析证实，高质量的视觉恢复直接提升了推理性能，将自我恢复确立为鲁棒视觉理解的关键机制。源代码已公开于https://github.com/jqtangust/Robust-U1。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.