一対多時間的グラウンディングに向けて

要旨

時間的グラウンディング（TG）は、テキストクエリに対応する映像セグメントを特定することを目的とする。既存研究の大半は単一セグメントの検索に焦点を当ててきた。しかし、現実のシナリオでは、単一のクエリに対して複数の非連続なセグメントを特定する必要がしばしば生じる。本稿ではこの設定をOne-to-Many Temporal Grounding（OMTG）と定義する。従来の最先端MLLMは一対一の設定に最適化されているため、この文脈では性能が著しく低下し、事象の基数認識の欠如によりスコアがほぼゼロになる。このギャップを埋めるため、本稿では3つの主要な貢献からなる体系的解決策を提示する。第一に、初の包括的なOMTGベンチマークを構築し、評価指標としてCount Accuracy（C-Acc）およびEffective Temporal F1（EtF1）を導入する。第二に、洗練された構築パイプラインを通じて5万6千サンプルからなる高品質OMTGデータセットを収集する。第三に、OMTGに特化した新規の時間的報酬関数とキャプション報酬関数を開発する。特にキャプション報酬は、密な映像キャプションに対する連鎖的推論を活用し、ポリシー最適化を精度と完全性の両面で明示的に導く。広範な実験により、本モデルはOMTG Bench上で43.65%のEtF1を達成し、Gemini 2.5 ProおよびSeed-1.8をそれぞれ15.85%および15.61%上回る新たな最先端性能を示す。

English

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.