KnowRL: 최소 충분 지식 지도를 활용한 강화 학습 기반 LLM 추론 성능 향상

초록

RLVR는 대규모 언어 모델의 추론 능력을 향상시키지만, 그 효과는 어려운 문제에서 발생하는 심각한 보상 희소성에 의해 종종 제한됩니다. 최근 힌트 기반 RL 방법론은 부분 해법이나 추상 템플릿을 주입하여 희소성을 완화하지만, 일반적으로 더 많은 토큰을 추가하여 지도를 확장하는 방식으로 인해 중복성, 불일치 및 추가적인 학습 오버헤드가 발생합니다. 우리는 힌트 설계를 최소-충분 지도 문제로 접근하는 RL 학습 프레임워크인 KnowRL(지식 기강화 학습)을 제안합니다. KnowRL은 RL 학습 과정에서 지도를 원자적 지식 포인트(KP)로 분해하고, 제약 하위 집합 탐색(CSS)을 사용하여 컴팩트하고 상호작용 인식적인 하위 집합을 구성하여 학습합니다. 우리는 더 나아가 가지치기 상호작용 패러독스(하나의 KP를 제거하면 도움이 될 수 있지만, 여러 개의 그러한 KP를 제거하면 성능이 저하될 수 있는 현상)를 규명하고, 이러한 의존성 구조 하에서 견고한 하위 집합 선별을 명시적으로 최적화합니다. 우리는 OpenMath-Nemotron-1.5B에서 KnowRL-Nemotron-1.5B를 학습시켰습니다. 15억 규모의 8가지 추론 벤치마크에서 KnowRL-Nemotron-1.5B는 강력한 RL 및 힌트 기반 비교 모델들을 일관되게 능가했습니다. 추론 시 KP 힌트 없이도 KnowRL-Nemotron-1.5B는 평균 정확도 70.08%를 달성하여 이미 Nemotron-1.5B를 +9.63점 차이로 앞섰으며, 선택된 KPs를 사용하면 성능이 74.16%로 향상되어 해당 규모에서 새로운 최첨단 기술을 확립했습니다. 모델, 정제된 학습 데이터 및 코드는 https://github.com/Hasuer/KnowRL 에서 공개되어 있습니다.

English

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose KnowRL (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.

KnowRL: 최소 충분 지식 지도를 활용한 강화 학습 기반 LLM 추론 성능 향상

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

초록

Support