ビデオ推論の解明

要旨

近年の映像生成技術の進歩により、驚くべき現象が明らかになってきた。拡散モデルベースの映像モデルが、些細ではない推論能力を示すのである。従来の研究では、この現象をChain-of-Frames（CoF）メカニズムに帰属させ、推論が映像フレーム間で順次展開されると仮定してきた。本研究ではこの仮定に異議を唱え、根本的に異なるメカニズムを明らかにする。我々は、映像モデルにおける推論が、主に拡散のノイズ除去ステップに沿って出現することを示す。質的分析と標的プロービング実験を通じて、モデルが初期のノイズ除去ステップで複数の候補解を探索し、漸進的に最終解へ収束するプロセスを発見した。これをChain-of-Steps（CoS）と命名する。この核心メカニズムを超えて、モデル性能に不可欠ないくつかの創発的推論行動を特定した：（1）持続的参照を可能にするワーキングメモリ、（2）誤った中間解からの回復を可能にする自己修正・強化、（3）初期ステップで意味的基盤を確立し、後期ステップで構造化された操作を実行する「知覚先行・行動後行」である。拡散ステップ内ではさらに、Diffusion Transformer内部に自己進化的な機能分化があることを解明した。初期層は密な知覚構造を符号化し、中間層は推論を実行し、後期層は潜在表現を統合する。これらの知見に動機付けられ、訓練不要の簡潔な戦略を概念実証として提示する。異なる乱数シードを持つ同一モデルから潜在軌跡をアンサンブルすることで、推論が如何に改善されるかを実証する。全体として、本研究は映像生成モデルにおける推論の創発メカニズムを体系的に理解し、映像モデルの内在的推論力学を新たな知能基盤として活用する将来研究の基礎を提供する。

English

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

ビデオ推論の解明

Demystifing Video Reasoning

要旨

Support