ChatPaper.aiChatPaper

TimeLens:基于多模态大语言模型的视频时序定位新思考

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

December 16, 2025
作者: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
cs.AI

摘要

本文并未提出创新方法,而是为视频时序定位这一视频理解核心能力建立了一个简洁、渐进但至关重要的基准。尽管多模态大语言模型在各类视频理解任务中表现卓越,但针对视频时序定位的优化方案仍待深入探索。本研究提出TimeLens,从数据质量与算法设计两个核心维度系统性地探索构建具备强视频时序定位能力的多模态大语言模型。我们首先揭示了现有视频时序定位基准数据集中的关键质量问题,并推出TimeLens-Bench——包含三个经严格质量标准重新标注的流行基准数据集。分析表明,与传统基准相比,模型评估排名发生显著变化,证实了既往评估标准的不可靠性。同时,我们通过自动化重标注流程处理噪声训练数据,构建了大规模高质量训练数据集TimeLens-100K。基于数据基础,我们深入探索算法设计原则,获得一系列具有启发性的发现及高效实用的方案,包括:用于时间表征的交错文本编码、基于可验证奖励的无思维强化学习训练范式,以及精心设计的强化学习训练方案。这些努力最终凝练为TimeLens模型系列——该开源多模态大语言模型在视频时序定位任务中不仅达到开源模型的最优性能,甚至超越了GPT-5和Gemini-2.5-Flash等专有模型。所有代码、数据与模型将全面公开以推动后续研究。
English
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
PDF61December 18, 2025