稳健性R1:面向鲁棒视觉理解的退化感知推理
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
December 19, 2025
作者: Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, Qifeng Chen
cs.AI
摘要
多模态大语言模型在极端现实视觉退化场景下难以保持稳定性能,这阻碍了其实际应用的鲁棒性。现有鲁棒性MLLM主要依赖仅关注视觉编码器泛化的隐式训练/适应方法,存在可解释性有限与孤立优化的问题。为突破这些局限,我们提出Robust-R1新型框架,通过结构化推理链显式建模视觉退化过程。该方法融合三大核心机制:(一)基于监督微调的退化感知推理基础构建;(二)面向退化参数精准感知的奖励驱动对齐策略;(三)适配退化强度的动态推理深度缩放。为支撑该方法,我们构建了包含11K样本的专业数据集,涵盖现实世界中四个关键视觉处理阶段合成的真实退化类型,每个样本均标注有连接退化参数、感知影响、原始语义推理链与结论的结构化链条。全面实验表明该方法实现顶尖鲁棒性:在真实退化基准R-Bench上,Robust-R1超越所有通用及鲁棒基线模型;同时在MMMB、MMStar和RealWorldQA的多强度对抗性退化测试中保持卓越的抗退化性能。
English
Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.