RealUnify:統一模型是否真正受益於統一化?一項全面性基準測試
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
September 29, 2025
作者: Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu
cs.AI
摘要
將視覺理解與生成能力整合至統一的多模態模型中,標誌著向通用人工智慧邁出了重要一步。然而,現有基準測試未能解答一個根本性問題:這種架構上的統一是否真正促成了各組成能力之間的協同互動?現有的評估範式主要孤立地評估理解與生成能力,無法確定統一模型是否能夠利用其理解能力來增強生成,或通過生成模擬來促進更深層次的理解。為填補這一關鍵空白,我們引入了RealUnify,這是一個專門設計用於評估雙向能力協同的基準測試。RealUnify包含1000個經過人工精心註釋的實例,涵蓋10個類別和32個子任務,其結構圍繞兩個核心軸線:1)理解增強生成,要求通過推理(如常識、邏輯)來指導圖像生成;2)生成增強理解,需要通過心理模擬或重建(如對變形或混亂的視覺輸入)來解決推理任務。我們的一個關鍵貢獻是雙重評估協議,該協議結合了直接的端到端評估與診斷性的逐步評估,將任務分解為獨立的理解和生成階段。這一協議使我們能夠精確識別性能瓶頸是源於核心能力的不足,還是整合這些能力的失敗。通過對12個領先的統一模型和6個專業基線模型的大規模評估,我們發現當前的統一模型在實現有效協同方面仍面臨挑戰,表明僅靠架構統一是不夠的。這些結果強調了需要新的訓練策略和歸納偏置,以充分釋放統一建模的潛力。
English
The integration of visual understanding and generation into unified
multimodal models represents a significant stride toward general-purpose AI.
However, a fundamental question remains unanswered by existing benchmarks: does
this architectural unification actually enable synergetic interaction between
the constituent capabilities? Existing evaluation paradigms, which primarily
assess understanding and generation in isolation, are insufficient for
determining whether a unified model can leverage its understanding to enhance
its generation, or use generative simulation to facilitate deeper
comprehension. To address this critical gap, we introduce RealUnify, a
benchmark specifically designed to evaluate bidirectional capability synergy.
RealUnify comprises 1,000 meticulously human-annotated instances spanning 10
categories and 32 subtasks. It is structured around two core axes: 1)
Understanding Enhances Generation, which requires reasoning (e.g., commonsense,
logic) to guide image generation, and 2) Generation Enhances Understanding,
which necessitates mental simulation or reconstruction (e.g., of transformed or
disordered visual inputs) to solve reasoning tasks. A key contribution is our
dual-evaluation protocol, which combines direct end-to-end assessment with a
diagnostic stepwise evaluation that decomposes tasks into distinct
understanding and generation phases. This protocol allows us to precisely
discern whether performance bottlenecks stem from deficiencies in core
abilities or from a failure to integrate them. Through large-scale evaluations
of 12 leading unified models and 6 specialized baselines, we find that current
unified models still struggle to achieve effective synergy, indicating that
architectural unification alone is insufficient. These results highlight the
need for new training strategies and inductive biases to fully unlock the
potential of unified modeling.