基于加权互信息数据选择的高效RLVR训练方法

摘要

强化学习（RL）在提升大型语言模型的推理能力与对齐效果方面发挥着核心作用，但其效率关键取决于训练数据的选择策略。现有在线选择方法主要依赖基于难度的启发式规则，偏好具有中等成功率的数据点，这种策略隐含地将难度等同于信息量，却忽视了因证据有限而产生的认知不确定性。我们提出InSight方法——一种基于加权互信息目标的信息导向式强化学习数据采样框架。通过使用贝叶斯潜在成功率对数据结果建模，我们证明了预期不确定性降低可分解为难度相关和证据相关的互补成分，从而揭示了纯难度选择策略的根本局限性。基于这一发现，InSight通过数据点成功率的均值估计（而非噪声采样结果）构建稳定的获取分数，并能自然扩展至可验证奖励强化学习（RLVR）中常见的多轮采样场景。大量实验表明，InSight能持续实现最先进性能并提升训练效率：在规划与数学基准测试中平均提升1.41分，通用推理任务提升1.01分，训练加速最高达约2.2倍，且仅带来可忽略的额外计算开销。

English

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.

基于加权互信息数据选择的高效RLVR训练方法

Efficient RLVR Training via Weighted Mutual Information Data Selection

摘要

Support