ChatPaper.aiChatPaper

视频推理揭秘

Demystifing Video Reasoning

March 17, 2026
作者: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang
cs.AI

摘要

近期视频生成领域的进展揭示了一个意外现象:基于扩散的视频模型展现出非平凡推理能力。先前研究将其归因于帧间链式推理机制,即假设推理过程在视频帧间顺序展开。本文挑战了这一假设,揭示了一种根本不同的机制。我们发现视频模型的推理能力主要沿扩散去噪步骤涌现:通过定性分析和针对性探测实验,发现模型在早期去噪步骤中探索多个候选解,并逐步收敛至最终答案,这一过程被我们称为步骤链式推理。除核心机制外,我们还识别出对模型性能至关重要的若干涌现行为:(1)工作记忆,实现持续参照;(2)自我校正与增强,允许从错误中间解恢复;(3)先感知后操作,即早期步骤建立语义基础,后期步骤执行结构化处理。在单步扩散过程中,我们进一步发现扩散Transformer内部自演进的功能分化:早期层编码密集感知结构,中间层执行推理,后期层整合潜在表征。基于这些发现,我们提出一种无需训练的简易策略作为概念验证,通过集成相同模型在不同随机种子下的潜在轨迹来提升推理能力。总体而言,本研究系统阐释了视频生成模型中推理能力的涌现机制,为未来研究如何更好地利用视频模型固有推理动态作为智能新基质奠定了理论基础。
English
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
PDF1184March 19, 2026