視覺生成透過多模態世界模型實現類人推理能力
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
January 27, 2026
作者: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
cs.AI
摘要
人類透過建構內部世界模型並操縱其中的概念進行推理。近期人工智慧領域的進展,特別是思維鏈推理技術,已能近似實現這類人類認知能力——世界模型被認為內嵌於大型語言模型之中。當前系統主要依賴言語推理,已在數學與程式設計等形式化與抽象領域達到專家級表現。然而在需要更豐富表徵與先驗知識的物理及空間智能領域,這些系統仍遠落後於人類。具備言語與視覺雙重生成能力的統一多模態模型因此引發關注,其基於互補多模態路徑實現類人推理的潛力雖尚不明確,但已點燃研究熱情。本文從世界模型視角出發,首次系統性探究視覺生成何時及如何促進推理。我們的核心主張是視覺優勢假說:對於某些任務(特別是物理世界相關任務),視覺生成能更自然地充當世界模型,而純言語世界模型則會遭遇表徵局限或先驗知識不足的瓶頸。理論上,我們將內部世界建模形式化為思維鏈推理的核心組件,並分析不同形式世界模型的區別。實證方面,我們識別出需要交錯式視覺-言語思維鏈推理的任務,建構新型評估套件VisWorld-Eval。在頂尖統一多模態模型上的對照實驗表明:交錯式思維鏈在傾向視覺世界建模的任務上顯著優於純言語思維鏈,但在其他任務中未展現明顯優勢。本研究共同闡明了多模態世界建模對實現更強大類人多模態人工智慧的潛力。
English
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.