ChatPaper.aiChatPaper

视觉生成通过多模态世界模型解锁类人推理

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

January 27, 2026
作者: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
cs.AI

摘要

人类通过构建内部世界模型并操纵其中的概念进行推理。人工智能领域的最新进展,特别是思维链推理技术,正在逼近这种人类认知能力——世界模型被认为内嵌于大语言模型之中。当前系统已能依靠以语言为主的推理方式,在数学、编程等形式化与抽象领域实现专家级表现。然而在需要更丰富表征与先验知识的物理、空间智能等领域,它们仍远落后于人类。兼具语言与视觉生成能力的统一多模态模型的出现,由此激发了人们对基于互补多模态路径、实现更类人推理的研究兴趣,但其优势尚不明确。本文从世界模型视角出发,首次系统性地研究了视觉生成在何时以及如何促进推理。我们提出的核心观点是视觉优势假说:对于某些任务(尤其是物理世界相关任务),视觉生成能更自然地充当世界模型,而纯语言世界模型则会受限于表征能力不足或先验知识欠缺。理论上,我们将内部世界建模形式化为思维链推理的核心组件,并分析不同形式世界模型的差异。实证方面,我们识别出需要交错式视觉-语言思维链推理的任务,构建了新型评估套件VisWorld-Eval。在顶尖统一多模态模型上的对照实验表明:在适合视觉世界建模的任务中,交错式思维链显著优于纯语言思维链,而在其他任务中则无明显优势。本研究从理论与实践层面阐明了多模态世界建模对开发更强大、更类人的多模态人工智能的潜力。
English
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.
PDF193January 29, 2026