UniG2U-Bench：统一模型是否推动多模态理解的发展？

摘要

近期，统一多模态模型展现出强大的生成能力，但生成是否以及何时能促进理解仍不明确。现有基准缺乏对生成促进理解的具体任务进行系统性探索。为此，我们提出UniG2U-Bench综合基准，将生成式理解评估划分为7大类别30项子任务，涵盖从隐性到显性的不同程度视觉转换需求。通过对30余个模型的大规模评估，我们获得三项核心发现：1）统一模型整体表现弱于其基础视觉语言模型，且"先生成后回答"推理模式通常较直接推理产生性能下降；2）在空间智能、视觉错觉或多轮推理子任务中持续出现性能提升，其中增强的空间形状感知能力与多步中间图像状态被证明具有积极作用；3）具有相似推理结构的任务及共享架构的模型表现出相关性，表明生成与理解的耦合会在任务、预训练数据和模型架构上诱发类别一致的归纳偏差。这些发现揭示了需要更丰富的训练数据和新范式来充分释放统一多模态建模的潜力。

English

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

UniG2U-Bench：统一模型是否推动多模态理解的发展？

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

摘要

Support