以影像思考:視訊生成作為極具潛力的多模態推理範式
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
November 6, 2025
作者: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
cs.AI
摘要
「文字思維」與「影像思維」範式能顯著提升大型語言模型(LLMs)與視覺語言模型(VLMs)的推理能力,但這些範式存在固有侷限性:(1)影像僅能捕捉單一瞬間,無法呈現動態過程或連續變化;(2)文字與視覺作為分立模態的區隔,阻礙了統一的多模態理解與生成。為突破這些限制,我們提出「影片思維」新範式,透過Sora-2等影片生成模型,在統一的時序框架中橋接視覺與文字推理。為支持此探索,我們開發了影片思維基準測試集(VideoThinkBench),涵蓋兩類任務:(1)視覺核心任務(如視覺謎題),(2)文字核心任務(如GSM8K、MMMU子集)。評估結果確立Sora-2作為高效推理器的能力:在視覺核心任務中,Sora-2整體可媲美頂尖VLMs,並在視覺遊戲等任務中超越VLMs;在文字核心任務中,Sora-2於MATH數據集達到92%準確率,MMMU數據集達75.53%。我們進一步系統性分析這些能力的來源,並發現自我一致性與情境學習能提升Sora-2表現。總體而言,本研究證實影片生成模型具備成為統一多模態理解與生成模型的潛力,使「影片思維」確立為統一的多模態推理範式。
English
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.