ChatPaper.aiChatPaper

Robust-U1:多模态大语言模型能否自我修复受损的视觉内容以实现鲁棒理解?

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

June 6, 2026
作者: Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen
cs.AI

摘要

多模态大语言模型(MLLMs)在视觉理解方面取得了显著成功,但其在真实世界的视觉损坏场景下性能会大幅下降。现有鲁棒性增强方法存在局限:黑盒特征对齐缺乏可解释性,而白盒文本推理无法恢复丢失的像素级细节。本研究探讨一个基本问题:MLLMs能否自主恢复受损的视觉内容?为此,我们提出Robust-U1——一种赋予MLLMs显式视觉自恢复能力以实现鲁棒理解的新框架。该方法包含三个核心阶段:用于初始重建的监督微调、采用双重奖励(像素级SSIM与语义级CLIP相似性)的强化学习以对齐高视觉质量,以及同时考虑损坏输入与恢复图像的多模态推理。大量实验表明,Robust-U1在真实世界损坏基准上实现了最先进的鲁棒性,并在通用VQA基准的对抗性损坏下保持了优越性能。分析证实,高质量的视觉恢复可直接增强推理性能,使自恢复成为鲁棒视觉理解的关键机制。源代码已开源至 https://github.com/jqtangust/Robust-U1。
English
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.