基于离策略影响引导的数据高效强化学习与视频检索

摘要

在基於可驗證獎勵的強化學習（RLVR）中，數據選擇是提升大型語言模型（LLM）推理能力的關鍵環節。現有數據選擇方法多基於啟發式策略，缺乏理論保證與泛化能力。本研究提出一種理論基礎紮實的影響函數方法，用於評估每個數據點對學習目標的貢獻度。為克服在線影響估計所需策略滾動的高昂計算成本，我們引入離策略影響估計方法，利用預先收集的離線軌跡高效近似數據影響力。針對LLM高維梯度難題，採用稀疏隨機投影技術降低維度，提升存儲與計算效率。基於這些技術，我們開發了具備離策略影響引導的課程強化學習框架（CROPI），該多階段RL框架能迭代選擇對當前策略最具影響力的數據。在參數量達70億的模型實驗中，CROPI顯著加速訓練過程：在15億參數模型上，僅使用每階段10%的數據即可實現2.66倍的步級加速效果。實驗結果凸顯了基於影響力的數據選擇在高效RLVR中的巨大潛力。

English

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

基于离策略影响引导的数据高效强化学习与视频检索

Data-Efficient RLVR via Off-Policy Influence Guidance

摘要

Support