基于加权互信息数据选择的高效RLVR训练方法

摘要

强化学习（RL）在提升大语言模型推理能力与对齐效果方面具有核心作用，但其效率关键取决于训练数据的选择策略。现有在线选择方法主要依赖基于难度的启发式规则，偏好中等成功率的数据点，这种策略隐含地将难度与信息量等同，却忽视了有限证据导致的认识不确定性。我们提出InSight——一种基于加权互信息目标的信息导向式RL训练数据采样方法。通过贝叶斯潜在成功率对数据结果建模，我们证明预期不确定性降低可分解为难度相关和证据相关的互补成分，从而揭示了纯难度选择策略的根本局限。基于此发现，InSight采用数据点成功率的均值估计（而非噪声采样结果）构建稳定的获取分数，并可自然扩展至带可验证奖励的强化学习（RLVR）中常见的多轮次场景。大量实验表明，InSight持续实现最先进性能并提升训练效率：在规划与数学基准测试中平均提升1.41分，通用推理任务提升1.01分，训练加速最高达2.2倍，且额外计算开销可忽略不计。

English

Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.