効率的なRLVRトレーニングのための重み付き相互情報量に基づくデータ選択

要旨

強化学習（RL）は大規模言語モデルの推論能力と整合性の向上に中心的な役割を果たすが、その効率は学習データの選択方法に大きく依存する。既存のオンライン選択戦略は、主に難易度に基づくヒューリスティクスに依存し、中間的な成功率を示すデータポイントを優先する傾向がある。これは暗黙的に「難しさ」を「情報量の多さ」と同一視し、限られた証拠から生じる認識論的不確実性を無視している。本論文では、情報量に基づくデータ選択手法であるInSight（INformation-guided data SamplInG metHod for RL Training）を提案する。この手法は、重み付き相互情報量の目的関数に基づいており、ベイズ的な潜在成功率でデータの結果をモデル化することで、期待される不確実性の低減が、相補的な難易度依存成分と証拠依存成分に分解できることを示す。これは、難易度のみに基づく選択の根本的な限界を明らかにするものである。この知見を活用し、InSightは、ノイズの多いサンプリング結果ではなく、データポイントの成功率の平均信念に基づいて安定した獲得スコアを構築する。さらに、検証可能な報酬を用いた強化学習（RLVR）で一般的な複数ロールアウト設定にも自然に拡張可能である。大規模な実験により、InSightが一貫して最先端の性能を達成し、学習効率を向上させることを実証した。具体的には、計画と数学のベンチマークで平均+1.41の向上、一般的な推論タスクで+1.01の改善、最大約2.2倍の高速化を実現し、追加の計算オーバーヘッドは無視できる程度であった。

English

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.

効率的なRLVRトレーニングのための重み付き相互情報量に基づくデータ選択

Efficient RLVR Training via Weighted Mutual Information Data Selection

要旨

Support