邁向一對多時間定位
Towards One-to-Many Temporal Grounding
June 4, 2026
作者: Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li
cs.AI
摘要
时序定位(Temporal Grounding, TG)旨在定位与文本查询相对应的视频片段。以往研究主要聚焦于单一段落的检索。然而,现实场景中常常需要为单个查询定位多个不连续的片段——我们将这一设定称为"一对多时序定位"(One-to-Many Temporal Grounding, OMTG)。此前最先进的多模态大语言模型(MLLMs)针对一对一设置优化,在此场景下表现不佳,常因缺乏事件基数感知而获得近乎为零的分数。为弥补这一差距,我们提出了一套系统性的解决方案,包含三项关键贡献。首先,我们建立了首个全面的OMTG基准测试,引入计数准确率(C-Acc)和有效时序F1值(EtF1)作为评估指标。其次,通过一套精密的构建流程,我们整理了一个包含5.6万个样本的高质量OMTG数据集。第三,我们针对OMTG开发了新颖的时序奖励函数和字幕奖励函数。特别地,字幕奖励函数利用对密集视频字幕的思维链推理,显式引导策略优化,兼顾准确性与完整性。大量实验表明,我们的模型在OMTG基准上实现了43.65%的EtF1值,达到新的最优水平,分别超越Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。
English
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.