理解是否引領統一多模態模型的生成?從分析到未來路徑
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
November 25, 2025
作者: Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan
cs.AI
摘要
近年來,統一多模態模型取得了顯著進展,但一個根本問題依然存在:理解是否真正能指導生成?為探究此問題,我們提出UniSandbox——一個配備受控合成數據集的解耦評估框架,既可避免數據洩漏,又能實現細粒度分析。研究發現揭示出理解與生成之間存在顯著鴻溝,主要體現在推理生成與知識遷移兩個關鍵維度。具體而言,在推理生成任務中,我們發現理解模塊的顯式思維鏈能有效彌合此鴻溝,並進一步證明自訓練方法可成功將此能力內化,實現生成過程中的隱式推理。此外在知識遷移任務中,思維鏈能通過協助檢索新習得知識來輔助生成過程,同時我們也發現基於查詢的架構本身具有影響知識遷移的潛在類思維鏈特性。UniSandbox為設計真正銜接理解與生成的未來統一架構與訓練策略提供了初步見解。程式碼與數據已開源於:https://github.com/PKU-YuanGroup/UniSandBox
English
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox