VisionCoach:通过视觉感知提示强化基于现实的视频推理
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
March 15, 2026
作者: Daeun Lee, Shoubin Yu, Yue Zhang, Mohit Bansal
cs.AI
摘要
视频推理要求模型能够在多帧画面中定位并追踪与问题相关的证据。尽管采用可验证奖励的强化学习提升了推理准确性,但在推理过程中仍难以实现可靠的时空定位。此外,改进定位能力通常需要扩大训练数据规模或依赖推理时的感知工具,这会增加标注成本或计算开销。为解决这一难题,我们提出VisonCoach框架——一种输入自适应的强化学习系统,通过视觉提示作为训练阶段的指导来提升时空定位能力。在强化学习训练过程中,视觉提示会针对具有挑战性的输入选择性激活,以增强问题相关证据并抑制干扰信息。随后模型通过自蒸馏机制内化这些改进,最终实现无需视觉提示即可直接对原始视频进行具象推理。VisonCoach包含两个核心组件:(1)视觉提示选择器:根据视频和问题内容预测合适的提示类型;(2)时空推理器:在视觉提示引导下进行强化学习优化,并采用强制物体身份一致性与多区域边界框重叠的对象感知定位奖励机制。大量实验表明,在可比设置下,VisonCoach在多样化视频推理(V-STAR、VideoMME)、视频理解(World-Sense、VideoMMMU)及时空定位基准测试(PerceptionTest、Charades-STA)中均达到最先进性能,同时保持单一高效推理路径且无需外部工具。我们的研究证实:训练阶段的视觉提示能有效提升具象视频推理能力,而自蒸馏技术可使模型在推理时无需提示即可内化这种能力。
English
Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.