RealUnify：統一模型是否真正受益於統一化？一項全面性基準測試

摘要

將視覺理解與生成能力整合至統一的多模態模型中，標誌著向通用人工智慧邁出了重要一步。然而，現有基準測試未能解答一個根本性問題：這種架構上的統一是否真正促成了各組成能力之間的協同互動？現有的評估範式主要孤立地評估理解與生成能力，無法確定統一模型是否能夠利用其理解能力來增強生成，或通過生成模擬來促進更深層次的理解。為填補這一關鍵空白，我們引入了RealUnify，這是一個專門設計用於評估雙向能力協同的基準測試。RealUnify包含1000個經過人工精心註釋的實例，涵蓋10個類別和32個子任務，其結構圍繞兩個核心軸線：1）理解增強生成，要求通過推理（如常識、邏輯）來指導圖像生成；2）生成增強理解，需要通過心理模擬或重建（如對變形或混亂的視覺輸入）來解決推理任務。我們的一個關鍵貢獻是雙重評估協議，該協議結合了直接的端到端評估與診斷性的逐步評估，將任務分解為獨立的理解和生成階段。這一協議使我們能夠精確識別性能瓶頸是源於核心能力的不足，還是整合這些能力的失敗。通過對12個領先的統一模型和6個專業基線模型的大規模評估，我們發現當前的統一模型在實現有效協同方面仍面臨挑戰，表明僅靠架構統一是不夠的。這些結果強調了需要新的訓練策略和歸納偏置，以充分釋放統一建模的潛力。

English

The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

RealUnify：統一模型是否真正受益於統一化？一項全面性基準測試

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

摘要

Support