以视频思考:视频生成作为一种前景广阔的多模态推理范式
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
November 6, 2025
作者: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
cs.AI
摘要
"文本思考"与"图像思考"范式显著提升了大型语言模型(LLMs)和视觉语言模型(VLMs)的推理能力,但这些范式存在固有局限:(1)图像仅能捕捉瞬时状态,无法呈现动态过程或连续变化;(2)文本与视觉作为独立模态的割裂,阻碍了统一的多模态理解与生成。为突破这些限制,我们提出"视频思考"新范式,通过Sora-2等视频生成模型在统一时序框架中桥接视觉与文本推理。为支撑这一探索,我们构建了视频思考基准测试集VideoThinkBench,涵盖两类任务:(1)视觉中心任务(如视觉谜题);(2)文本中心任务(如GSM8K、MMMU子集)。评估表明Sora-2具备卓越推理能力:在视觉任务中与顶尖VLMs表现相当,且在视觉游戏等任务中实现反超;在文本任务中于MATH数据集达到92%准确率,MMMU数据集达到75.53%。我们系统分析了其能力来源,并发现自洽性与上下文学习能进一步提升Sora-2性能。研究表明,视频生成模型有望成为统一的多模态理解与生成载体,使"视频思考"成为统一的多模态推理范式。
English
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.