일대다 시간적 정박을 향하여

초록

시간적 정박(Temporal Grounding, TG)은 텍스트 질의에 대응하는 비디오 구간을 찾는 것을 목표로 한다. 기존 연구는 주로 단일 구간 검색에 초점을 맞추었다. 그러나 실제 시나리오에서는 단일 질의에 대해 여러 개의 분리된 구간을 찾아야 하는 경우가 빈번하며, 이를 일대다 시간적 정박(One-to-Many Temporal Grounding, OMTG)이라고 정의한다. 기존 최첨단 MLLM들은 일대일 설정에 최적화되어 있어, 이 맥락에서는 이벤트 개수 인식 부족으로 인해 거의 0에 가까운 점수를 기록하며 어려움을 겪는다. 이러한 격차를 해소하기 위해, 본 연구는 세 가지 주요 기여를 포함한 체계적인 해결책을 제시한다. 첫째, 최초의 포괄적인 OMTG 벤치마크를 구축하고, 평가 지표로 정확도(Count Accuracy, C-Acc)와 효과적 시간적 F1(Effective Temporal F1, EtF1)을 도입한다. 둘째, 정교한 구축 파이프라인을 통해 56,000개의 샘플로 구성된 고품질 OMTG 데이터셋을 구축한다. 셋째, OMTG에 특화된 새로운 시간적 보상 함수와 캡션 보상 함수를 개발한다. 특히 캡션 보상은 밀집 비디오 캡션에 대한 사고 사슬 추론을 활용하여 정책 최적화를 정밀성과 완전성 모두를 향해 명시적으로 유도한다. 광범위한 실험 결과, 본 모델은 OMTG 벤치마크에서 EtF1 43.65%의 새로운 최첨단 성능을 달성하며, Gemini 2.5 Pro 및 Seed-1.8을 각각 15.85% 및 15.61% 능가하는 것으로 나타났다.

English

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.