透過影片進行推理:透過迷宮解謎任務首次評估影片模型的推理能力
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
November 19, 2025
作者: Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu
cs.AI
摘要
影片模型已成功實現高保真度影片生成與連貫動態效果,其發展軌跡類似語言模型從文本生成邁向文本推理的歷程。這促使我們思考:影片模型能否透過影片生成進行推理?相較於離散的文本語料庫,影片以明確的空間佈局與時間連續性為基礎,成為空間推理的理想載體。本研究探索「以影片為媒介的推理」範式,並提出VR-Bench——一個系統性評估影片模型推理能力的綜合基準。該基準以迷宮解題任務為基礎,內含對空間規劃與多步驟推理的本質需求,共包含五類迷宮型態與多樣視覺風格下生成的7,920個程序化影片。實證分析表明,監督式微調能有效激發影片模型的推理能力。影片模型在推理過程中展現出更強的空間感知能力,其表現超越主流視覺語言模型,並能適應多樣化場景、任務與複雜度。我們進一步發現測試時擴展效應:推理階段採用多樣化採樣可使推理可靠性提升10%-20%。這些發現凸顯了「以影片推理」模式在空間推理任務中獨特的潛力與擴展性。
English
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.