TimeSearch-R:基于自验证强化学习的自适应时序搜索技术实现长视频理解
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
November 7, 2025
作者: Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She
cs.AI
摘要
时序搜索旨在根据给定查询从数万帧视频中识别最小相关帧集合,为精准的长视频理解奠定基础。现有研究尝试逐步缩小搜索范围,但这些方法通常依赖人工设计的搜索流程,缺乏对最优搜索策略进行端到端优化的能力。本文提出TimeSearch-R框架,将时序搜索重构为文本-视频交错思考过程,通过强化学习将视频片段搜索无缝集成到推理流程中。然而,将群体相对策略优化(GRPO)等强化学习训练方法应用于视频推理时,会导致无监督的中间搜索决策,进而引发视频内容探索不足与逻辑推理不一致的问题。为解决此问题,我们提出带完备性自验证的GRPO(GRPO-CSV),通过收集交错推理过程中搜索到的视频帧,并利用同一策略模型验证已搜索帧的充分性,从而提升视频推理的完备性。此外,我们专门构建了适用于GRPO-CSV的SFT冷启动和强化学习训练的数据集,通过筛选时序关联性弱的样本来增强任务难度,提升时序搜索能力。大量实验表明,TimeSearch-R在Haystack-LVBench、Haystack-Ego4D等时序搜索基准,以及VideoMME、MLVU等长视频理解基准上均取得显著提升。特别值得注意的是,TimeSearch-R在LongVideoBench上创造了最新纪录,较基础模型Qwen2.5-VL提升4.1%,较先进视频推理模型Video-R1提升2.0%。代码已开源:https://github.com/Time-Search/TimeSearch-R。
English
Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.