ChatPaper.aiChatPaper

框架化思維:視覺上下文與測試時縮放如何增強影片推理能力

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

January 28, 2026
作者: Chengzu Li, Zanyi Wang, Jiaang Li, Yi Xu, Han Zhou, Huanyu Zhang, Ruichuan An, Dengyang Jiang, Zhaochong An, Ivan Vulić, Serge Belongie, Anna Korhonen
cs.AI

摘要

視覺語言模型在文本推理方面表現卓越,但在細粒度空間理解與連續動作規劃上往往存在不足,難以模擬複雜視覺推理所需的動態過程。本研究通過視頻生成模型構建視覺推理框架,提出生成幀可作為初始狀態與解決方案之間的中間推理步驟。我們在兩種不同範式中評估其能力:具有低視覺變化特性的序列離散規劃任務「迷宮導航」,以及呈現高視覺變化的連續操作任務「七巧板拼圖」。實驗揭示三項關鍵發現:(1)強健的零樣本泛化能力:模型在未經特定微調的情況下,對未見過的數據分佈均展現出優異性能;(2)視覺上下文效用:模型能有效利用智能體圖標和七巧板形狀等視覺上下文作為顯式控制信號,保持高度視覺一致性,並將規劃能力穩健適應於未知模式;(3)視覺測試時標度律:我們觀察到序列規劃中存在測試時標度定律——增加生成視頻長度(視覺推理預算)可增強模型對時空複雜路徑的零樣本泛化能力。這些發現表明視頻生成不僅是媒體工具,更是一種可擴展、可泛化的視覺推理範式。
English
Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.
PDF133February 7, 2026