以影像思考：視訊生成作為極具潛力的多模態推理範式

摘要

「文字思維」與「影像思維」範式能顯著提升大型語言模型（LLMs）與視覺語言模型（VLMs）的推理能力，但這些範式存在固有侷限性：（1）影像僅能捕捉單一瞬間，無法呈現動態過程或連續變化；（2）文字與視覺作為分立模態的區隔，阻礙了統一的多模態理解與生成。為突破這些限制，我們提出「影片思維」新範式，透過Sora-2等影片生成模型，在統一的時序框架中橋接視覺與文字推理。為支持此探索，我們開發了影片思維基準測試集（VideoThinkBench），涵蓋兩類任務：（1）視覺核心任務（如視覺謎題），（2）文字核心任務（如GSM8K、MMMU子集）。評估結果確立Sora-2作為高效推理器的能力：在視覺核心任務中，Sora-2整體可媲美頂尖VLMs，並在視覺遊戲等任務中超越VLMs；在文字核心任務中，Sora-2於MATH數據集達到92%準確率，MMMU數據集達75.53%。我們進一步系統性分析這些能力的來源，並發現自我一致性與情境學習能提升Sora-2表現。總體而言，本研究證實影片生成模型具備成為統一多模態理解與生成模型的潛力，使「影片思維」確立為統一的多模態推理範式。

English

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

以影像思考：視訊生成作為極具潛力的多模態推理範式

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

摘要

Support