视频模型早期推理：利用规划承诺解决迷宫问题

摘要

视频扩散模型展现出解决迷宫与谜题等涌现推理能力，但其生成过程中的推理机制尚不明确。我们以二维迷宫求解为受控实验平台，首次对视频模型的内部规划动态展开研究。研究发现主要有二：首先是早期规划承诺现象——视频扩散模型在前几个去噪步骤中即确定高层运动规划，后续去噪仅改变视觉细节而不影响底层轨迹；其次是路径长度（而非障碍物密度）成为迷宫难度的决定性因素，且在12步处存在明显失效阈值。这表明视频模型需通过多轮序列生成串联才能推理长迷宫。基于此，我们提出"早期规划链式推理法"（ChEaP），该方法仅对具有潜力早期规划的种子进行计算，并通过链式拼接应对复杂迷宫。在Wan2.2-14B和HunyuanVideo-1.5模型上的实验表明，该方法将长视野迷宫求解准确率从7%提升至67%，在Frozen Lake和VR-Bench硬任务上整体性能提升2.5倍。我们的分析揭示，当前视频模型具有比既往认知更深刻的推理能力，通过改进推理时缩放策略可更可靠地激发这种能力。

English

Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.