UniG2U-Bench：統一模型是否推動了多模態理解？

摘要

近期，統一多模態模型展現出強大的生成能力，但生成是否以及何時能提升理解能力仍不明確。現有基準缺乏對生成促進理解的具體任務進行系統性探索。為此，我們提出UniG2U-Bench——一個將生成到理解（G2U）評估劃分為7大範疇、30項子任務的綜合基準，這些任務需要不同程度的隱性或顯性視覺轉換。透過對超過30個模型的大規模評估，我們發現三個核心結論：1）統一模型整體表現遜於其基礎視覺語言模型，且「生成後回答」推理方式通常會比直接推理降低性能；2）在空間智能、視覺錯覺或多輪推理子任務中出現持續性提升，其中增強的空間與形狀感知能力以及多步驟中間圖像狀態被證實具有益處；3）具有相似推理結構的任務與共享架構的模型會呈現相關行為，表明生成與理解的耦合會對任務、預訓練數據和模型架構產生類別一致的歸納偏置。這些發現凸顯了需要更多樣化的訓練數據與新範式，才能充分釋放統一多模態建模的潛能。

English

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

UniG2U-Bench：統一模型是否推動了多模態理解？

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

摘要

Support