理解是否促进统一多模态模型的生成能力?从分析到未来路径
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
November 25, 2025
作者: Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan
cs.AI
摘要
近年来,统一多模态模型取得了显著进展,但一个根本性问题依然存在:理解是否真正促进了生成?为探究此问题,我们推出UniSandbox——一个结合受控合成数据集的解耦评估框架,既可避免数据泄露又能实现细粒度分析。研究发现存在显著的理解-生成差距,主要体现在推理生成与知识迁移两个关键维度。具体而言,在推理生成任务中,理解模块的显式思维链能有效弥合这一差距,并通过自训练方法成功内化该能力,实现生成过程中的隐式推理。在知识迁移任务中,思维链通过辅助检索新习得知识来促进生成过程,同时发现基于查询的架构天然具备影响知识迁移的类思维链隐式特性。UniSandbox为未来真正弥合理解与生成鸿沟的统一架构设计与训练策略提供了初步洞见。代码与数据详见:https://github.com/PKU-YuanGroup/UniSandBox
English
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox