CLEAR: 통합 멀티모달 모델의 저하된 이미지 이해를 위한 생성 능력 활용

초록

블러, 노이즈, 압축, 저조도로 인한 이미지 열화는 실제 환경에서의 다중모달 이해를 심각하게 저해합니다. 이해와 생성을 단일 아키텍처 내에서 결합하는 통합 다중모달 모델은 생성 경로가 열화로 파괴되는 세밀한 시각적 구조를 모델링할 수 있으므로 이러한 문제에 자연스럽게 부합합니다. 그러나 이러한 모델들은 열화된 입력에 대해 자체 생성 능력을 활용하지 못하고 있습니다. 우리는 이러한 단절이 두 가지 중첩된 요인에서 비롯된다고 분석합니다. 기존 훈련 체계는 모델이 추론 과정에서 생성을 활용하도록 요구하지 않으며, 표준 디코드-재인코드 경로는 효과적인 공동 최적화를 지원하지 않습니다. 우리는 CLEAR를 제시합니다. 이 프레임워크는 세 가지 점진적 단계를 통해 두 가지 능력을 연결합니다: (1) 열화 인식 데이터셋에 대한 지도 미세 조정을 통해 생성-후-응답 추론 패턴을 확립; (2) 생성과 추론 사이의 직접적이고 최적화 가능한 연결로 디코드-재인코드 우회 경로를 대체하는 잠재 표현 브리지; (3) 응답 정확도 보상 하에서 텍스트 추론과 시각 생성을 공동으로 최적화하는 강화 학습 방법인 Interleaved GRPO. 우리는 6개의 표준 다중모달 벤치마크에 걸쳐 세 가지 열화 심각도 수준을 아우르는 MMD-Bench를 구축했습니다. 실험 결과 CLEAR는 열화된 입력에 대한 강건성을 크게 향상시키면서도 원본 이미지 성능을 유지하는 것으로 나타났습니다. 우리의 분석은 픽셀 수준 재구성 지도를 제거하면 지각적 품질이 더 높은 중간 시각 상태가 도출됨을 보여주며, 이는 과제 주도 최적화와 시각적 품질이 자연스럽게 조화됨을 시사합니다.

English

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

CLEAR: 통합 멀티모달 모델의 저하된 이미지 이해를 위한 생성 능력 활용

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

초록

Support