TempSamp-R1:基于强化学习微调的高效时序采样方法 ——面向视频大语言模型
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
September 22, 2025
作者: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
cs.AI
摘要
本文介绍了TempSamp-R1,一种新型的强化微调框架,旨在提升多模态大语言模型(MLLMs)在视频时序定位任务中的适应效果。我们发现,现有的强化学习方法,如群体相对策略优化(GRPO),依赖于同策略采样进行策略更新。然而,在具有大时序搜索空间的任务中,这一策略不仅效率低下,且性能受限,往往难以找到时序上精确的解决方案。为克服这一局限,TempSamp-R1利用真实标注作为异策略监督,提供时序精确的指导,有效弥补了同策略解决方案中的稀疏性和不对齐问题。为进一步稳定训练并减少基于奖励更新的方差,TempSamp-R1提出了一种非线性软优势计算方法,通过非对称变换动态重塑奖励反馈。通过采用混合思维链(CoT)训练范式,TempSamp-R1优化了一个单一的统一模型,以支持CoT和非CoT推理模式,从而高效处理不同推理复杂度的查询。实验结果表明,TempSamp-R1超越了基于GRPO的基线,在基准数据集上确立了新的最先进性能:Charades-STA(R1@0.7:52.9%,+2.7%)、ActivityNet Captions(R1@0.5:56.0%,+5.3%)和QVHighlights(mAP:30.0%,+3.0%)。此外,TempSamp-R1在有限数据下展现了强大的少样本泛化能力。代码地址:https://github.com/HVision-NKU/TempSamp-R1
English
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework
designed to improve the effectiveness of adapting multimodal large language
models (MLLMs) to video temporal grounding tasks. We reveal that existing
reinforcement learning methods, such as Group Relative Policy Optimization
(GRPO), rely on on-policy sampling for policy updates. However, in tasks with
large temporal search spaces, this strategy becomes both inefficient and
limited in performance, as it often fails to identify temporally accurate
solutions. To address this limitation, TempSamp-R1 leverages ground-truth
annotations as off-policy supervision to provide temporally precise guidance,
effectively compensating for the sparsity and misalignment in on-policy
solutions. To further stabilize training and reduce variance in reward-based
updates, TempSamp-R1 provides a non-linear soft advantage computation method
that dynamically reshapes the reward feedback via an asymmetric transformation.
By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1
optimizes a single unified model to support both CoT and non-CoT inference
modes, enabling efficient handling of queries with varying reasoning
complexity. Experimental results demonstrate that TempSamp-R1 outperforms
GRPO-based baselines, establishing new state-of-the-art performance on
benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions
(R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover,
TempSamp-R1 shows robust few-shot generalization capabilities under limited
data. Code: https://github.com/HVision-NKU/TempSamp-R1