KnowRL：基于最小充分知识引导的强化学习提升大型语言模型推理能力

摘要

RLVR方法虽能提升大语言模型的推理能力，但其在难题上的效果常受限于严重的奖励稀疏性。近期基于提示的强化学习方法通过注入部分解或抽象模板缓解稀疏问题，但这些方法通常通过增加标记数量来扩展指导，会引入冗余性、不一致性及额外训练开销。我们提出KnowRL（知识引导的强化学习），该强化学习训练框架将提示设计视为最小充分引导问题。在强化学习训练过程中，KnowRL将引导分解为原子化知识点（KP），并采用约束子集搜索（CSS）构建紧凑且具备交互感知的训练子集。我们进一步发现剪枝交互悖论——移除单个KP可能有益，而移除多个此类KP反而有害——并基于此依赖结构显式优化鲁棒的子集筛选方案。我们从OpenMath-Nemotron-1.5B出发训练得到KnowRL-Nemotron-1.5B模型。在1.5B规模的八个推理基准测试中，KnowRL-Nemotron-1.5B始终优于强化的RL和提示基线方法。无需在推理时使用KP提示的情况下，该模型已达到70.08%的平均准确率，较Nemotron-1.5B提升9.63个百分点；加入精选KP后性能进一步提升至74.16%，创造了该规模下的新性能纪录。模型、精选训练数据及代码已开源：https://github.com/Hasuer/KnowRL。

English

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose KnowRL (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.