DeepVideo-R1：基於難度感知回歸GRPO的影片強化微調

摘要

近期研究證實了基於強化學習（RL）的後訓練方法在提升大型語言模型（LLMs）推理能力方面的有效性。特別是，群組相對策略優化（GRPO）通過採用PPO風格的強化算法與基於群組的標準化獎勵，展現了令人矚目的成功。然而，GRPO在視頻大型語言模型（Video LLMs）中的應用研究尚顯不足。本文探討了GRPO在視頻LLMs中的應用，並識別了阻礙其有效學習的兩個主要問題：（1）對安全措施的依賴，以及（2）優勢消失問題。為應對這些挑戰，我們提出了DeepVideo-R1，這是一個採用我們提出的回歸式GRPO（Reg-GRPO）及難度感知數據增強策略訓練的視頻大型語言模型。Reg-GRPO將GRPO目標重新表述為回歸任務，直接預測GRPO中的優勢值。這一設計消除了對裁剪和最小值函數等安全措施的需求，從而通過使模型與優勢值對齊，促進了更直接的策略指導。我們還設計了難度感知數據增強策略，該策略在可解決的難度水平上動態增強訓練樣本，促進了多樣化且信息豐富的獎勵信號。我們的全面實驗表明，DeepVideo-R1在多個視頻推理基準測試中顯著提升了視頻推理性能。

English

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training in enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success by employing a PPO-style reinforcement algorithm with group-based normalized rewards. However, the application of GRPO to Video Large Language Models (Video LLMs) has been less studied. In this paper, we explore GRPO for video LLMs and identify two primary issues that impede its effective learning: (1) reliance on safeguards, and (2) the vanishing advantage problem. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with our proposed Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation strategy. Reg-GRPO reformulates the GRPO objective as a regression task, directly predicting the advantage in GRPO. This design eliminates the need for safeguards like clipping and min functions, thereby facilitating more direct policy guidance by aligning the model with the advantage values. We also design the difficulty-aware data augmentation strategy that dynamically augments training samples at solvable difficulty levels, fostering diverse and informative reward signals. Our comprehensive experiments show that DeepVideo-R1 significantly improves video reasoning performance across multiple video reasoning benchmarks.

DeepVideo-R1：基於難度感知回歸GRPO的影片強化微調

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

摘要

Support