DeepVideo-R1:基於難度感知回歸GRPO的影片強化微調
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
June 9, 2025
作者: Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim
cs.AI
摘要
近期研究證實了基於強化學習(RL)的後訓練方法在提升大型語言模型(LLMs)推理能力方面的有效性。特別是,群組相對策略優化(GRPO)通過採用PPO風格的強化算法與基於群組的標準化獎勵,展現了令人矚目的成功。然而,GRPO在視頻大型語言模型(Video LLMs)中的應用研究尚顯不足。本文探討了GRPO在視頻LLMs中的應用,並識別了阻礙其有效學習的兩個主要問題:(1)對安全措施的依賴,以及(2)優勢消失問題。為應對這些挑戰,我們提出了DeepVideo-R1,這是一個採用我們提出的回歸式GRPO(Reg-GRPO)及難度感知數據增強策略訓練的視頻大型語言模型。Reg-GRPO將GRPO目標重新表述為回歸任務,直接預測GRPO中的優勢值。這一設計消除了對裁剪和最小值函數等安全措施的需求,從而通過使模型與優勢值對齊,促進了更直接的策略指導。我們還設計了難度感知數據增強策略,該策略在可解決的難度水平上動態增強訓練樣本,促進了多樣化且信息豐富的獎勵信號。我們的全面實驗表明,DeepVideo-R1在多個視頻推理基準測試中顯著提升了視頻推理性能。
English
Recent works have demonstrated the effectiveness of reinforcement learning
(RL)-based post-training in enhancing the reasoning capabilities of large
language models (LLMs). In particular, Group Relative Policy Optimization
(GRPO) has shown impressive success by employing a PPO-style reinforcement
algorithm with group-based normalized rewards. However, the application of GRPO
to Video Large Language Models (Video LLMs) has been less studied. In this
paper, we explore GRPO for video LLMs and identify two primary issues that
impede its effective learning: (1) reliance on safeguards, and (2) the
vanishing advantage problem. To mitigate these challenges, we propose
DeepVideo-R1, a video large language model trained with our proposed Reg-GRPO
(Regressive GRPO) and difficulty-aware data augmentation strategy. Reg-GRPO
reformulates the GRPO objective as a regression task, directly predicting the
advantage in GRPO. This design eliminates the need for safeguards like clipping
and min functions, thereby facilitating more direct policy guidance by aligning
the model with the advantage values. We also design the difficulty-aware data
augmentation strategy that dynamically augments training samples at solvable
difficulty levels, fostering diverse and informative reward signals. Our
comprehensive experiments show that DeepVideo-R1 significantly improves video
reasoning performance across multiple video reasoning benchmarks.