KnowRL:透過最小充分知識引導的強化學習提升大型語言模型推理能力
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
April 14, 2026
作者: Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu
cs.AI
摘要
RLVR方法能提升大型語言模型的推理能力,但其效果常因難題上的嚴重獎勵稀疏性而受限。近期基於提示的強化學習方法通過注入部分解題步驟或抽象模板來緩解稀疏性問題,但這些方法通常通過增加標記數量來擴充指導內容,反而引入冗餘性、不一致性及額外訓練負擔。我們提出KnowRL(知識導向強化學習框架),將提示設計視為最小充分指導問題。在強化學習訓練過程中,KnowRL將指導內容分解為原子化知識點,並採用約束子集搜索算法構建緊湊且具互動感知的訓練子集。我們進一步發現「修剪互動悖論」現象——移除單個知識點可能有益,但移除多個相關知識點反而有害——並針對此依賴結構明確優化魯棒性子集篩選機制。我們基於OpenMath-Nemotron-1.5B訓練出KnowRL-Nemotron-1.5B模型。在1.5B規模的八項推理基準測試中,KnowRL-Nemotron-1.5B持續超越強力強化學習與提示基準方法。無需推理階段的知識點提示時,模型平均準確率達70.08%,已超越原版Nemotron-1.5B達9.63個百分點;搭配精選知識點後,性能進一步提升至74.16%,創下該規模模型的新標竿。模型、精選訓練資料與程式碼已公開於https://github.com/Hasuer/KnowRL。
English
RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose KnowRL (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.