ChatPaper.aiChatPaper

帧间思维:视觉语境与测试时缩放如何增强视频推理能力

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

January 28, 2026
作者: Chengzu Li, Zanyi Wang, Jiaang Li, Yi Xu, Han Zhou, Huanyu Zhang, Ruichuan An, Dengyang Jiang, Zhaochong An, Ivan Vulić, Serge Belongie, Anna Korhonen
cs.AI

摘要

视觉语言模型在文本推理方面表现出色,但在细粒度空间理解和连续动作规划上往往存在不足,难以模拟复杂视觉推理所需的动态过程。本研究通过视频生成模型构建视觉推理框架,提出生成帧可作为初始状态与解决方案之间的中间推理步骤。我们在两种不同场景下评估其能力:视觉变化较小的序列离散规划任务"迷宫导航",以及视觉变化显著的连续操作任务"七巧板拼图"。实验揭示三个关键发现:(1)强大的零样本泛化能力:模型在未经特定微调的情况下,对未见数据分布均表现出强劲性能;(2)视觉上下文利用:模型能有效运用智能体图标和七巧板形状等视觉上下文作为显式控制,保持高度视觉一致性,并对未见模式展现稳健的规划适应性;(3)视觉测试时扩展规律:在序列规划任务中观察到测试时扩展定律——增加生成视频长度(视觉推理预算)可提升模型对时空复杂路径的零样本泛化能力。这些发现表明视频生成不仅是媒体工具,更是一种可扩展、可泛化的视觉推理新范式。
English
Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.
PDF133February 7, 2026