オフポリシー影響誘導によるデータ効率の良いRLVR

要旨

データ選択は、大規模言語モデル（LLM）の推論能力を強化するための検証可能な報酬を用いた強化学習（RLVR）において極めて重要な側面である。現在のデータ選択手法は、その多くがヒューリスティックに基づいており、理論的保証や一般化性を欠いている。本研究では、学習目標に対する各データ点の貢献度を推定するために、影響関数に基づく理論的基礎を有するアプローチを提案する。オンライン影響推定に必要とされる計算コストが高い方策ロールアウトの問題を克服するため、事前に収集したオフライン軌跡を用いてデータ影響を効率的に近似するオフ方策影響推定法を導入する。さらに、LLMの高次元勾配を扱うために、スパースランダム射影を用いて次元を削減し、記憶容量と計算効率を改善する。これらの技術を活用し、現在の方策に対して最も影響力の大きいデータを反復的に選択する多段階RLフレームワークである、オフ方策影響ガイダンスを用いたカリキュラムRL（CROPI）を開発した。70億パラメータまでのモデルを用いた実験により、CROPIが訓練を大幅に加速させることを実証する。15億パラメータモデルでは、フルデータセットを用いた訓練と比較して、各段階でデータの10%のみを使用しながら、ステップレベルで2.66倍の加速を達成した。我々の結果は、効率的なRLVRのための影響力ベースのデータ選択の大きな可能性を明らかにするものである。

English

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

オフポリシー影響誘導によるデータ効率の良いRLVR

Data-Efficient RLVR via Off-Policy Influence Guidance

要旨

Support