TempSamp-R1:基於強化學習微調的高效時序採樣方法,適用於視頻大語言模型
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
September 22, 2025
作者: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
cs.AI
摘要
本文介紹了TempSamp-R1,這是一種新的強化微調框架,旨在提升多模態大語言模型(MLLMs)在視頻時間定位任務中的適應效果。我們揭示現有的強化學習方法,如群組相對策略優化(GRPO),依賴於在線採樣進行策略更新。然而,在具有大時間搜索空間的任務中,這種策略既效率低下又性能有限,因為它往往無法識別出時間上精確的解決方案。為解決這一限制,TempSamp-R1利用真實標註作為離線監督,提供時間上精確的指導,有效補償在線解決方案中的稀疏性和不對齊問題。為了進一步穩定訓練並減少基於獎勵更新的方差,TempSamp-R1提供了一種非線性軟優勢計算方法,通過非對稱變換動態重塑獎勵反饋。通過採用混合思維鏈(CoT)訓練範式,TempSamp-R1優化了一個單一的統一模型,以支持CoT和非CoT推理模式,從而高效處理不同推理複雜度的查詢。實驗結果表明,TempSamp-R1優於基於GRPO的基線,在基準數據集上建立了新的最先進性能:Charades-STA(R1@0.7: 52.9%, +2.7%)、ActivityNet Captions(R1@0.5: 56.0%, +5.3%)和QVHighlights(mAP: 30.0%, +3.0%)。此外,TempSamp-R1在有限數據下展示了強大的少樣本泛化能力。代碼:https://github.com/HVision-NKU/TempSamp-R1
English
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework
designed to improve the effectiveness of adapting multimodal large language
models (MLLMs) to video temporal grounding tasks. We reveal that existing
reinforcement learning methods, such as Group Relative Policy Optimization
(GRPO), rely on on-policy sampling for policy updates. However, in tasks with
large temporal search spaces, this strategy becomes both inefficient and
limited in performance, as it often fails to identify temporally accurate
solutions. To address this limitation, TempSamp-R1 leverages ground-truth
annotations as off-policy supervision to provide temporally precise guidance,
effectively compensating for the sparsity and misalignment in on-policy
solutions. To further stabilize training and reduce variance in reward-based
updates, TempSamp-R1 provides a non-linear soft advantage computation method
that dynamically reshapes the reward feedback via an asymmetric transformation.
By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1
optimizes a single unified model to support both CoT and non-CoT inference
modes, enabling efficient handling of queries with varying reasoning
complexity. Experimental results demonstrate that TempSamp-R1 outperforms
GRPO-based baselines, establishing new state-of-the-art performance on
benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions
(R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover,
TempSamp-R1 shows robust few-shot generalization capabilities under limited
data. Code: https://github.com/HVision-NKU/TempSamp-R1