解碼影片推理的奧秘
Demystifing Video Reasoning
March 17, 2026
作者: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang
cs.AI
摘要
近期影片生成領域的進展揭示了一個意外現象:基於擴散模型的影片生成系統展現出非平凡的推理能力。過往研究將此歸因於「幀間鏈式推理」機制,假設推理過程會沿影片幀序列逐步展開。本研究挑戰此假設,揭示了一種截然不同的運作機制。我們發現影片模型的推理能力主要沿著擴散去噪步驟湧現。透過質性分析與定向探測實驗,我們觀察到模型在早期去噪階段會探索多種候選方案,並逐步收斂至最終答案,此過程被我們命名為「步驟鏈式推理」。除核心機制外,我們還識別出三種對模型性能至關重要的湧現推理行為:(1)工作記憶能力,實現持續參照;(2)自我校正與增強機制,允許從錯誤中間解恢復;(3)先感知後操作的特性,即早期步驟建立語義基礎,後續步驟執行結構化操控。在單個擴散步驟中,我們進一步發現擴散轉換器內部存在自發形成的功能專化現象:早期層編碼密集感知結構,中間層執行推理運算,後期層整合潛在表徵。基於這些發現,我們提出一種無需訓練的簡易策略作為概念驗證,通過整合相同模型在不同隨機種子下的潛在軌跡來提升推理性能。總體而言,本研究系統性闡明了影片生成模型中推理能力的湧現機制,為未來研究如何善用影片模型內在推理動態作為新型智能基底奠定了基礎。
English
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.