ChatPaper.aiChatPaper

CLEAR:释放统一多模态模型中退化图像理解的生成潜力

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

April 6, 2026
作者: Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen, Yiqian Zhang, Haiyun Guo, Shuohuan Wang, Yu Sun
cs.AI

摘要

在实际应用场景中,图像因模糊、噪点、压缩及光照不足导致的退化问题严重制约了多模态理解效果。将理解与生成功能整合于单一架构的统一多模态模型天然适合应对这一挑战——其生成路径能建模退化过程破坏的细粒度视觉结构。然而现有模型未能充分利用自身生成能力处理退化输入。我们发现这种脱节源于两个叠加因素:现有训练机制从未要求模型在推理过程中调用生成能力,且标准的"解码-再编码"路径无法支持有效的联合优化。本文提出CLEAR框架,通过三个渐进步骤连接两种能力:(1)在退化感知数据集上进行监督微调,建立"先生成后回答"的推理模式;(2)引入潜在表示桥接机制,用可优化的直接连接替代迂回的"解码-再编码"路径;(3)设计交错式GRPO强化学习方法,在答案正确性奖励下联合优化文本推理与视觉生成。我们构建了MMD-Bench评估基准,涵盖六大标准多模态测试集的三种退化严重程度。实验表明CLEAR在保持清晰图像性能的同时,显著提升了模型对退化输入的鲁棒性。进一步分析表明,去除像素级重建监督后产生的中间视觉状态具有更高感知质量,这揭示出任务驱动优化与视觉质量存在内在一致性。
English
Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.
PDF51April 8, 2026