VisionCoach: 視覚的知覚プロンプティングによる接地された映像推論の強化

要旨

映像推論では、モデルがフレーム間で質問に関連する証拠を特定し追跡する必要があります。検証可能な報酬を用いた強化学習（RL）は精度向上に寄与するものの、推論過程における信頼性の高い時空間的グラウンディングの実現には依然として課題があります。さらに、グラウンディングの改善は、通常、大規模な訓練データまたは推論時の知覚ツールに依存しており、アノテーションコストや計算コストの増大を招きます。この課題に対処するため、我々は**VisonCoach**を提案します。これは、訓練時のガイダンスとして視覚的プロンプトを用いて時空間的グラウンディングを改善する、入力適応型のRLフレームワークです。RL訓練中、視覚的プロンプトは困難な入力に対して選択的に適用され、質問関連の証拠を増幅し、妨害要素を抑制します。モデルはその後、自己蒸留を通じてこれらの改善点を内在化し、推論時には視覚的プロンプトなしで生の映像に対して直接グラウンディングされた推論を行えるようにします。VisonCoachは二つのコンポーネントから構成されます：(1) **視覚的プロンプト選択器**：映像と質問に条件付けられて適切なプロンプトタイプを予測するもの、(2) **時空間推論器**：視覚的プロンプトのガイダンスと、オブジェクトの同一性の一貫性および複数領域のバウンディングボックス重複を強化するオブジェクト認識型グラウンディング報酬の下で最適化されるRLモデルです。大規模な実験により、VisonCoachが、多様な映像推論、映像理解、時間的グラウンディングのベンチマーク（V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, Charades-STA）において、同等の設定下で最先端の性能を達成し、外部ツールを必要としない単一の効率的な推論経路を維持することが実証されました。我々の結果は、訓練時の視覚的プロンプトがグラウンディングされた映像推論を改善する一方で、自己蒸留によりモデルがこの能力を推論時にプロンプトを必要とせず内在化できることを示しています。

English

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.

VisionCoach: 視覚的知覚プロンプティングによる接地された映像推論の強化

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

要旨

Support