Robust-U1: 멀티모달 대규모 언어 모델(MLLM)은 강건한 이해를 위해 손상된 시각적 콘텐츠를 자체 복구할 수 있는가?

초록

멀티모달 대규모 언어 모델(MLLM)은 시각적 이해에서 놀라운 성공을 거두었지만, 실제 환경의 시각적 손상 하에서는 성능이 현저히 저하된다. 기존의 강건성 향상 접근법이 존재하지만, 블랙박스 특징 정렬은 해석 가능성이 부족하고, 화이트박스 텍스트 기반 추론은 손실된 픽셀 수준의 세부 정보를 복원할 수 없다는 한계가 있다. 본 연구는 근본적인 연구 질문을 탐구한다: MLLM이 스스로 손상된 시각적 콘텐츠를 복구할 수 있는가? 이에 답하기 위해, 우리는 Robust-U1이라는 새로운 프레임워크를 제안한다. 이는 MLLM에 명시적인 시각적 자기 회복 능력을 부여하여 강건한 이해를 가능하게 한다. 접근법은 세 가지 핵심 단계로 구성된다: 초기 복원을 위한 지도 미세 조정, 높은 시각적 품질 정렬을 위한 이중 보상(픽셀 수준의 SSIM 및 의미 수준의 CLIP 유사도)을 활용한 강화 학습, 그리고 손상된 입력과 복구된 이미지를 함께 고려하는 멀티모달 추론이다. 광범위한 실험을 통해 Robust-U1은 실제 환경 손상 벤치마크에서 최첨단 강건성을 달성하고, 일반 VQA 벤치마크에서 적대적 손상 하에서도 우수한 성능을 유지함을 입증한다. 분석 결과, 고품질의 시각적 복구가 추론 성능을 직접적으로 향상시키며, 자기 회복이 강건한 시각적 이해를 위한 핵심 메커니즘임을 확인하였다. 소스 코드는 https://github.com/jqtangust/Robust-U1에서 공개된다.

English

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.