ChatPaper.aiChatPaper

架構解耦並非統一多模态模型的萬靈丹

Architecture Decoupling Is Not All You Need For Unified Multimodal Model

November 27, 2025
作者: Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li
cs.AI

摘要

統一多模態模型在圖像生成與理解領域的顯著進展,標誌著向通用人工智慧邁出的重要一步,已引起研究者的廣泛關注。該任務的主要挑戰在於,由於理解與生成任務存在本質目標衝突,難以建立最優訓練範式。為緩解這些衝突並追求更高性能,許多研究者採用不同程度的模型解耦策略(例如雙圖像編碼器、MOE/MOT架構或凍結多模態大語言模型)。然而,過度的模型解耦可能導致交錯生成能力喪失,背離統一模型的初衷。本研究旨在探索如何在不依賴模型解耦的前提下緩解任務衝突。首先,我們通過分析模型的跨模態注意力行為,探究解耦策略緩解衝突的機理。觀察發現,模型解耦本質上是驅動模型形成任務專屬的多模態互動模式(如Qwen-VL與HunyuanImage所示),且解耦越徹底,行為一致性越高。受此啟發,我們提出注意力互動對齊損失函數,在訓練過程中顯式學習任務專屬的多模態互動模式。為驗證該損失函數的泛化能力,我們分別在Emu3的指令微調階段與Janus-Pro的後訓練階段進行驗證。實驗表明,無需複雜技巧的注意力互動對齊損失不僅能優化跨模態注意力模式,同時提升生成與理解雙重性能。
English
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
PDF231December 2, 2025