基于离策略影响引导的数据高效强化学习与视频检索
Data-Efficient RLVR via Off-Policy Influence Guidance
October 30, 2025
作者: Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang
cs.AI
摘要
在基於可驗證獎勵的強化學習(RLVR)中,數據選擇是提升大型語言模型(LLM)推理能力的關鍵環節。現有數據選擇方法多基於啟發式策略,缺乏理論保證與泛化能力。本研究提出一種理論基礎紮實的影響函數方法,用於評估每個數據點對學習目標的貢獻度。為克服在線影響估計所需策略滾動的高昂計算成本,我們引入離策略影響估計方法,利用預先收集的離線軌跡高效近似數據影響力。針對LLM高維梯度難題,採用稀疏隨機投影技術降低維度,提升存儲與計算效率。基於這些技術,我們開發了具備離策略影響引導的課程強化學習框架(CROPI),該多階段RL框架能迭代選擇對當前策略最具影響力的數據。在參數量達70億的模型實驗中,CROPI顯著加速訓練過程:在15億參數模型上,僅使用每階段10%的數據即可實現2.66倍的步級加速效果。實驗結果凸顯了基於影響力的數據選擇在高效RLVR中的巨大潛力。
English
Data selection is a critical aspect of Reinforcement Learning with Verifiable
Rewards (RLVR) for enhancing the reasoning capabilities of large language
models (LLMs). Current data selection methods are largely heuristic-based,
lacking theoretical guarantees and generalizability. This work proposes a
theoretically-grounded approach using influence functions to estimate the
contribution of each data point to the learning objective. To overcome the
prohibitive computational cost of policy rollouts required for online influence
estimation, we introduce an off-policy influence estimation method that
efficiently approximates data influence using pre-collected offline
trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we
employ sparse random projection to reduce dimensionality and improve storage
and computation efficiency. Leveraging these techniques, we develop
Curriculum RL with Off-Policy
Influence guidance (CROPI), a multi-stage RL framework that
iteratively selects the most influential data for the current policy.
Experiments on models up to 7B parameters demonstrate that CROPI significantly
accelerates training. On a 1.5B model, it achieves a 2.66x step-level
acceleration while using only 10\% of the data per stage compared to
full-dataset training. Our results highlight the substantial potential of
influence-based data selection for efficient RLVR.