Robust-U1：多模态大语言模型能否自我修复受损的视觉内容以实现鲁棒理解？

摘要

多模态大语言模型（MLLMs）在视觉理解方面取得了显著成功，但其在真实世界的视觉损坏场景下性能会大幅下降。现有鲁棒性增强方法存在局限：黑盒特征对齐缺乏可解释性，而白盒文本推理无法恢复丢失的像素级细节。本研究探讨一个基本问题：MLLMs能否自主恢复受损的视觉内容？为此，我们提出Robust-U1——一种赋予MLLMs显式视觉自恢复能力以实现鲁棒理解的新框架。该方法包含三个核心阶段：用于初始重建的监督微调、采用双重奖励（像素级SSIM与语义级CLIP相似性）的强化学习以对齐高视觉质量，以及同时考虑损坏输入与恢复图像的多模态推理。大量实验表明，Robust-U1在真实世界损坏基准上实现了最先进的鲁棒性，并在通用VQA基准的对抗性损坏下保持了优越性能。分析证实，高质量的视觉恢复可直接增强推理性能，使自恢复成为鲁棒视觉理解的关键机制。源代码已开源至 https://github.com/jqtangust/Robust-U1。

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.