CLEAR: 統一マルチモーダルモデルにおける劣化画像理解のための生成的潜在能力解放

要旨

ぼやけ、ノイズ、圧縮、照明不良による画像劣化は、実世界環境におけるマルチモーダル理解を著しく損なう。理解と生成を単一アーキテクチャに統合した統一マルチモーダルモデルは、その生成的経路が劣化によって破壊される微細な視覚構造をモデル化できるため、この課題に自然に対応し得る。しかしながら、これらのモデルは劣化入力に対して自身の生成能力を活用できていない。本研究では、この断絶の原因を二つの複合的要因に遡る：既存の訓練手法では推論中に生成を呼び出すことをモデルに要求せず、標準的な復号-再符号化経路は効果的な共同最適化をサポートしない。我々はCLEARを提案する。これは三つの段階的アプローチを通じて二つの能力を接続するフレームワークである：(1) 劣化対応データセットによる教師ありファインチューニングにより、生成後回答する推論パターンを確立、(2) 潜在表現ブリッジにより、復号-再符号化の迂回を、生成と推論間の直接最適化可能な接続に置換、(3) 回答正確性報酬の下でテキスト推論と視覚生成を共同最適化する強化学習手法Interleaved GRPO。6つの標準マルチモーダルベンチマークを3つの劣化レベルで網羅するMMD-Benchを構築。実験により、CLEARが劣化入力に対する頑健性を大幅に改善しつつ、清浄画像性能を維持することを示す。さらに分析により、ピクセルレベル再構成の教師信号を除去することで、知覚品質が高い中間視覚状態が得られることが明らかとなり、タスク駆動型最適化と視覚品質が自然に一致することを示唆する。

English

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

CLEAR: 統一マルチモーダルモデルにおける劣化画像理解のための生成的潜在能力解放

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

要旨

Support