映像で思考する：有望なマルチモーダル推論パラダイムとしての映像生成

要旨

「テキストを用いた思考」と「画像を用いた思考」というパラダイムは、大規模言語モデル（LLM）および視覚言語モデル（VLM）の推論能力を大幅に向上させる。しかし、これらのパラダイムには固有の限界が存在する。(1) 画像は単一の瞬間しか捉えられず、動的なプロセスや連続的な変化を表現できないこと、(2) テキストと視覚が異なるモダリティとして分離されているため、統一的なマルチモーダル理解と生成が妨げられることである。これらの限界を克服するため、我々は「動画を用いた思考」という新たなパラダイムを提案する。これはSora-2などの動画生成モデルを活用し、時間的枠組みの中で視覚的推論とテキスト的推論を統合するものである。この探求を支援するため、我々はVideo Thinking Benchmark（VideoThinkBench）を開発した。VideoThinkBenchは二つのタスクカテゴリを含む：(1) 視覚中心タスク（例：目測パズル）、(2) テキスト中心タスク（例：GSM8K、MMMUのサブセット）。評価の結果、Sora-2は有能な推論モデルであることが確認された。視覚中心タスクでは、Sora-2は概して最先端（SOTA）のVLMと同等の性能を示し、目測ゲームなどのいくつかのタスクではVLMを凌駕した。テキスト中心タスクでは、MATHで92%、MMMUで75.53%の精度を達成した。さらに、我々はこれらの能力の源泉を体系的に分析し、自己一貫性や文脈内学習がSora-2の性能を向上させ得ることも明らかにした。総括すると、動画生成モデルが統一的なマルチモーダル理解・生成モデルとなる可能性を示し、「動画を用いた思考」を統一的なマルチモーダル推論パラダイムとして位置づけるものである。

English

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

映像で思考する：有望なマルチモーダル推論パラダイムとしての映像生成

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

要旨

Support