DeepVideo-R1:基于难度感知回归GRPO的视频强化微调
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
June 9, 2025
作者: Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim
cs.AI
摘要
近期研究证实了基于强化学习(RL)的后训练方法在提升大规模语言模型(LLMs)推理能力方面的有效性。特别是,群体相对策略优化(GRPO)通过采用PPO风格的强化算法结合基于群体的归一化奖励,展现了显著的成效。然而,GRPO在视频大语言模型(Video LLMs)中的应用研究尚不充分。本文探讨了GRPO在视频LLMs中的应用,并识别出阻碍其有效学习的两个主要问题:(1)对安全措施的依赖,以及(2)优势消失问题。为应对这些挑战,我们提出了DeepVideo-R1,这是一个采用我们提出的回归式GRPO(Reg-GRPO)及难度感知数据增强策略训练的视频大语言模型。Reg-GRPO将GRPO目标重构为回归任务,直接预测GRPO中的优势值。这一设计摒弃了如裁剪和最小值函数等安全措施,通过使模型与优势值对齐,实现了更直接的策略指导。我们还设计了难度感知数据增强策略,动态地在可解决难度级别上扩充训练样本,促进多样且信息丰富的奖励信号。全面的实验表明,DeepVideo-R1在多个视频推理基准测试中显著提升了视频推理性能。
English
Recent works have demonstrated the effectiveness of reinforcement learning
(RL)-based post-training in enhancing the reasoning capabilities of large
language models (LLMs). In particular, Group Relative Policy Optimization
(GRPO) has shown impressive success by employing a PPO-style reinforcement
algorithm with group-based normalized rewards. However, the application of GRPO
to Video Large Language Models (Video LLMs) has been less studied. In this
paper, we explore GRPO for video LLMs and identify two primary issues that
impede its effective learning: (1) reliance on safeguards, and (2) the
vanishing advantage problem. To mitigate these challenges, we propose
DeepVideo-R1, a video large language model trained with our proposed Reg-GRPO
(Regressive GRPO) and difficulty-aware data augmentation strategy. Reg-GRPO
reformulates the GRPO objective as a regression task, directly predicting the
advantage in GRPO. This design eliminates the need for safeguards like clipping
and min functions, thereby facilitating more direct policy guidance by aligning
the model with the advantage values. We also design the difficulty-aware data
augmentation strategy that dynamically augments training samples at solvable
difficulty levels, fostering diverse and informative reward signals. Our
comprehensive experiments show that DeepVideo-R1 significantly improves video
reasoning performance across multiple video reasoning benchmarks.