ChatPaper.aiChatPaper

TimeLens:以多模態大型語言模型重新審視影片時間定位

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

December 16, 2025
作者: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
cs.AI

摘要

本文並未提出新穎方法,而是為影片時間定位這項影片理解核心能力建立了一個簡潔、漸進但至關重要的基準。儘管多模態大型語言模型在各類影片理解任務中表現卓越,但針對VTG的優化方案仍待深入探索。本研究提出TimeLens,從數據質量與演算法設計兩大維度系統性探究如何構建具備強勁VTG能力的MLLMs。我們首先揭露現有VTG基準測試中的關鍵質量缺陷,進而推出TimeLens-Bench——包含三個經嚴格質量標準精心重標註的熱門基準數據集。分析顯示模型排名相較傳統基準出現劇烈變動,證實過往評估標準的不可靠性。我們還通過自動化重標註流程解決訓練數據噪聲問題,構建出大規模高質量訓練數據集TimeLens-100K。基於數據基礎,我們深入探索演算法設計原則,獲得一系列具啟發性的洞見與高效實用的實踐方案,包括:採用交錯文本編碼表示時間信息、以無思維驗證獎勵強化學習作為訓練範式,以及精心設計的RLVR訓練方案。這些努力最終凝結為TimeLens模型系列,該系列開源MLLMs在VTG性能上不僅達到開源模型最優水平,更超越GPT-5與Gemini-2.5-Flash等專有模型。所有程式碼、數據與模型將公開以推動後續研究。
English
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
PDF61December 18, 2025