VLMs通过自适应测试时优化成为视频推理的优秀教师

摘要

近期提出的“视频推理”范式利用视频生成模型（VGM）生成时间连贯的视觉轨迹以完成推理任务。尽管最先进的VGM在视觉质量上表现卓越，但它们往往难以理解并遵循任务特定规则，导致在各类推理场景中出现逻辑错误。现有尝试借助视觉语言模型（VLM）作为问题预求解器，为VGM生成或优化文本指导。然而，文本描述难以捕捉复杂的时空细节，且即便在有效规划下，VGM仍常难以精准执行细粒度或长尾指令。尽管VLM作为求解器存在局限，但其具备强大的感知能力，可评估过程约束满足度与最终目标达成度。基于这一优势，我们提出范式转换，将VLM的角色转变为“教师”。具体而言，VLM教师提取任务特定规则以构建可微分奖励函数，通过测试时轻量级LoRA模块的在线优化来引导VGM推理器。该策略实现了自适应测试时优化，并将推理能力拓展至VGM固有边界之外。在符号视频推理基准（VBVR-Bench）与通用视频推理基准（RULER-Bench）上的评估表明，所提方法平均性能提升16.7个百分点，大幅优于VLM即求解器范式（+0.4个百分点）及Best-of-N缩放策略（+2.2个百分点），且测试时成本相当。这些发现揭示，将VLM作为测试时教师集成，为实现可泛化视频推理提供了有前景的范式。项目页面：https://VLM-as-Teacher.github.io/

English

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/