KnowRL: 最小十分知識ガイダンスによる強化学習を用いた大規模言語モデルの推論能力強化

要旨

RLVR（強化学習による推論能力向上）は大規模言語モデルの推論能力を改善するが、その効果は難易度の高い問題における深刻な報酬の疎性によって制限されることが多い。最近のヒントベースの強化学習手法は、部分解や抽象的なテンプレートを注入することで疎性を緩和するが、一般的にはトークンを追加して指導を拡張するため、冗長性、不整合性、および追加の学習オーバーヘッドが生じる。我々はKnowRL（知識誘導型強化学習）を提案する。これはヒント設計を最小十分な指導問題として扱う強化学習フレームワークである。KnowRLは強化学習訓練中、指導を原子的知識ポイント（KP）に分解し、制約付き部分集合探索（CSS）を用いて、コンパクトで相互作用を考慮した部分集合を構築する。さらに我々は「1つのKPを除去すると有益だが、複数の同様のKPを除去すると有害となる」という剪定相互作用のパラドックスを特定し、この依存構造の下で頑健な部分集合選定を明示的に最適化する。OpenMath-Nemotron-1.5BからKnowRL-Nemotron-1.5Bを訓練した。1.5B規模における8つの推論ベンチマークで、KnowRL-Nemotron-1.5Bは強力な強化学習およびヒントベースラインを一貫して上回った。推論時にKPヒントなしでは、KnowRL-Nemotron-1.5Bは70.08%の平均精度に達し、既にNemotron-1.5Bを+9.63ポイント上回っている。選択されたKPを使用すると性能は74.16%に向上し、この規模における新たなstate-of-the-artを確立した。モデル、選定された訓練データ、コードはhttps://github.com/Hasuer/KnowRL で公開されている。

English

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose KnowRL (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.

KnowRL: 最小十分知識ガイダンスによる強化学習を用いた大規模言語モデルの推論能力強化

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

要旨

Support