统一多模态模型中理解与生成能力差距的量化研究

摘要

近期，统一多模态模型（UMM）在理解与生成任务上均取得了显著进展。然而，这两种能力是否真正在单一模型内实现对齐与融合仍不明确。为探究此问题，我们提出GapEval——一个用于量化理解与生成能力间差距的双向基准测试框架，可对两个"统一"方向的认知连贯性进行定量测量。该框架中每个问题均可通过图像和文本双模态作答，从而实现对模型双向推理能力与跨模态一致性的对称评估。实验表明，在不同架构的多种UMM中，两个方向始终存在性能差距，这暗示当前模型仅实现了表层统一，而非两种能力的深度认知融合。为深入探究内在机制，我们从知识操纵视角展开实证研究以揭示其根本局限。研究发现：UMM中的知识常处于割裂状态，跨模态的能力涌现与知识发展存在异步性，这为后续研究指明了方向。

English

Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two "unified" directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.

统一多模态模型中理解与生成能力差距的量化研究

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

摘要

Support