统一多模态模型中理解与生成能力的差距量化
Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
February 2, 2026
作者: Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, Tianyi Zhou
cs.AI
摘要
近期,统一多模态模型(UMM)在理解与生成任务上均取得了显著进展。然而,这两种能力是否真正在单一模型内实现对齐与融合仍不明确。为探究此问题,我们提出GapEval——一个双向评估基准,旨在量化理解与生成能力间的差距,并定量测量两个"统一"方向的认知一致性。该基准的每个问题均可通过图像和文本双模态作答,从而对称评估模型的双向推理能力与跨模态一致性。实验表明,在不同架构的多种UMM中,两个方向始终存在性能差距,这暗示当前模型仅实现了表层统一,而非深层的认知融合。为深入探索内在机制,我们从知识操纵角度展开实证研究以揭示其根本局限。研究发现,UMM中的知识常处于割裂状态,跨模态的能力涌现与知识发展存在异步性,这为后续研究指明了方向。
English
Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two "unified" directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.