ChatPaper.aiChatPaper

VLMs通过自适应测试时优化成为视频推理的优秀教师

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

June 1, 2026
作者: Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan, Kun Gai, Jing Liao
cs.AI

摘要

近期提出的“视频推理”范式利用视频生成模型(VGM)生成时间连贯的视觉轨迹以完成推理任务。尽管最先进的VGM在视觉质量上表现卓越,但它们往往难以理解并遵循任务特定规则,导致在各类推理场景中出现逻辑错误。现有尝试借助视觉语言模型(VLM)作为问题预求解器,为VGM生成或优化文本指导。然而,文本描述难以捕捉复杂的时空细节,且即便在有效规划下,VGM仍常难以精准执行细粒度或长尾指令。尽管VLM作为求解器存在局限,但其具备强大的感知能力,可评估过程约束满足度与最终目标达成度。基于这一优势,我们提出范式转换,将VLM的角色转变为“教师”。具体而言,VLM教师提取任务特定规则以构建可微分奖励函数,通过测试时轻量级LoRA模块的在线优化来引导VGM推理器。该策略实现了自适应测试时优化,并将推理能力拓展至VGM固有边界之外。在符号视频推理基准(VBVR-Bench)与通用视频推理基准(RULER-Bench)上的评估表明,所提方法平均性能提升16.7个百分点,大幅优于VLM即求解器范式(+0.4个百分点)及Best-of-N缩放策略(+2.2个百分点),且测试时成本相当。这些发现揭示,将VLM作为测试时教师集成,为实现可泛化视频推理提供了有前景的范式。项目页面:https://VLM-as-Teacher.github.io/
English
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/