視覺語言模型經由自適應測試時優化成為影片推理的優良教師

摘要

近期提出的「影片推理」典範，利用影片生成模型（VGM）產出時序一致的視覺軌跡來完成推理任務。儘管最先進的VGM在視覺品質上表現優異，卻常難以理解並遵循任務特定規則，導致在各類推理情境中出現邏輯失誤。現有做法嘗試運用視覺語言模型（VLM）作為問題預解器，為VGM產生或精煉文字指引。然而，文字描述無法完整捕捉複雜的時空細節，且即便已有可行計畫，VGM仍難以忠實執行細粒度或長尾指令。雖然VLM作為解題者表現有限，但其具備強大的感知能力，可評估過程約束的滿足度與最終目標的達成度。基此優勢，我們提出典範轉移，將VLM的角色轉變為「教師」。具體而言，VLM教師提取任務特定規則以制定可微分獎勵，透過測試時線上優化輕量LoRA模組，引導VGM推理器。此策略可實現適應性測試時優化，並將推理能力擴展至VGM內在邊界之外。在符號性（VBVR-Bench）與通用型（RULER-Bench）影片推理基準的評估中，所提方法平均提升16.7個百分點，在可比測試時成本下，大幅優於VLM-as-Solver典範（+0.4點）與Best-of-N擴展（+2.2點）。這些發現揭示，將VLM整合為測試時教師，為實現可泛化影片推理提供了極具前景的典範。專案頁面：https://VLM-as-Teacher.github.io/

English

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/