视觉教练：通过视觉感知提示强化基于视频的推理能力

摘要

视频推理要求模型在连续帧中定位并追踪与问题相关的证据。虽然采用可验证奖励的强化学习（RL）能够提升准确性，但在推理过程中仍难以实现可靠的时空定位。此外，改进定位能力通常依赖于扩大训练数据规模或使用推理时的感知工具，这会增加标注成本或计算开销。为解决这一挑战，我们提出VisonCoach框架——一种输入自适应的强化学习方法，通过视觉提示作为训练时指导来提升时空定位能力。在强化学习训练过程中，视觉提示会针对具有挑战性的输入选择性激活，以增强问题相关证据并抑制干扰信息。随后模型通过自蒸馏机制内化这些改进，从而在推理时无需视觉提示即可直接对原始视频进行 grounded 推理。VisonCoach包含两个核心组件：（1）视觉提示选择器，根据视频和问题内容预测适用的提示类型；（2）时空推理器，在视觉提示引导及对象感知定位奖励的优化下进行强化学习，该奖励机制通过保持对象身份一致性和多区域边界框重叠来强化定位效果。大量实验表明，在多种视频推理、视频理解和时序定位基准（V-STAR、VideoMME、World-Sense、VideoMMMU、PerceptionTest 和 Charades-STA）上，VisonCoach在可比设置下均达到最先进性能，同时保持单一高效推理路径且无需外部工具。我们的研究证明，训练阶段的视觉提示能有效提升视频推理的定位能力，而自蒸馏技术可使模型在不依赖推理时提示的情况下内化这种能力。

English

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.