TempSamp-R1: ビデオLLMのための強化学習による効果的な時間サンプリング

要旨

本論文では、マルチモーダル大規模言語モデル（MLLM）を映像時間的グラウンディングタスクに適応させる効果を向上させるために設計された新しい強化学習ファインチューニングフレームワーク、TempSamp-R1を紹介する。既存の強化学習手法、例えばGroup Relative Policy Optimization（GRPO）は、ポリシー更新のためにオン・ポリシーサンプリングに依存している。しかし、時間的探索空間が大きいタスクでは、この戦略は非効率的であり、性能も限定的となることが明らかになった。なぜなら、時間的に正確な解を見つけることがしばしば困難であるためである。この制約を解決するために、TempSamp-R1は、グラウンドトゥルースアノテーションをオフ・ポリシー監視として活用し、時間的に精密なガイダンスを提供することで、オン・ポリシー解の希薄さと不整合を効果的に補う。さらに、トレーニングを安定化し、報酬ベースの更新における分散を低減するために、TempSamp-R1は非線形ソフトアドバンテージ計算手法を提供し、非対称変換を通じて報酬フィードバックを動的に再形成する。ハイブリッドChain-of-Thought（CoT）トレーニングパラダイムを採用することで、TempSamp-R1は単一の統合モデルを最適化し、CoTと非CoTの両方の推論モードをサポートし、さまざまな推論複雑性を持つクエリを効率的に処理することを可能にする。実験結果は、TempSamp-R1がGRPOベースのベースラインを上回り、ベンチマークデータセットにおいて新たな最先端の性能を確立することを示している：Charades-STA（R1@0.7: 52.9%, +2.7%）、ActivityNet Captions（R1@0.5: 56.0%, +5.3%）、およびQVHighlights（mAP: 30.0%, +3.0%）。さらに、TempSamp-R1は、限られたデータ下でのロバストな少数ショット汎化能力を示す。コード: https://github.com/HVision-NKU/TempSamp-R1

English

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

TempSamp-R1: ビデオLLMのための強化学習による効果的な時間サンプリング

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

要旨

Support