ChatPaper.aiChatPaper

架构解耦并非统一多模态模型的唯一关键

Architecture Decoupling Is Not All You Need For Unified Multimodal Model

November 27, 2025
作者: Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li
cs.AI

摘要

统一多模态模型在图像生成与理解领域代表着迈向通用人工智能的重要一步,已引起研究者的广泛关注。该任务的主要挑战在于,由于理解与生成任务存在内在目标冲突,难以建立最优训练范式。为缓解这些冲突并追求更高性能,许多研究者采用不同程度的模型解耦策略(如双图像编码器、MOE/MOT架构或冻结多模态大语言模型)。然而过度解耦会导致交错生成能力丧失,违背统一模型的初衷。本文旨在探索如何在不依赖模型解耦的情况下缓解任务冲突。首先,我们通过研究模型的跨模态注意力行为,分析了解耦策略缓解冲突的机理。实验发现模型解耦本质上会驱动模型形成任务特定的多模态交互模式(如Qwen-VL与HunyuanImage所示),且解耦越彻底,行为一致性越强。受此启发,我们提出注意力交互对齐损失函数,在训练过程中显式学习任务特定的多模态交互模式。为验证该损失函数的泛化性,我们分别将其应用于Emu3的指令微调阶段和Janus-Pro的后训练阶段。实验表明,无需复杂技巧的AIA损失不仅能优化跨模态注意力模式,还可同步提升生成与理解性能。
English
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
PDF231December 2, 2025