ChatPaper.aiChatPaper

視覺語言模型經由自適應測試時優化成為影片推理的優良教師

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

June 1, 2026
作者: Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan, Kun Gai, Jing Liao
cs.AI

摘要

近期提出的「影片推理」典範,利用影片生成模型(VGM)產出時序一致的視覺軌跡來完成推理任務。儘管最先進的VGM在視覺品質上表現優異,卻常難以理解並遵循任務特定規則,導致在各類推理情境中出現邏輯失誤。現有做法嘗試運用視覺語言模型(VLM)作為問題預解器,為VGM產生或精煉文字指引。然而,文字描述無法完整捕捉複雜的時空細節,且即便已有可行計畫,VGM仍難以忠實執行細粒度或長尾指令。雖然VLM作為解題者表現有限,但其具備強大的感知能力,可評估過程約束的滿足度與最終目標的達成度。基此優勢,我們提出典範轉移,將VLM的角色轉變為「教師」。具體而言,VLM教師提取任務特定規則以制定可微分獎勵,透過測試時線上優化輕量LoRA模組,引導VGM推理器。此策略可實現適應性測試時優化,並將推理能力擴展至VGM內在邊界之外。在符號性(VBVR-Bench)與通用型(RULER-Bench)影片推理基準的評估中,所提方法平均提升16.7個百分點,在可比測試時成本下,大幅優於VLM-as-Solver典範(+0.4點)與Best-of-N擴展(+2.2點)。這些發現揭示,將VLM整合為測試時教師,為實現可泛化影片推理提供了極具前景的典範。專案頁面:https://VLM-as-Teacher.github.io/
English
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/