推論を超えて：強化学習が大規模言語モデルのパラメトリック知識を解放する

要旨

強化学習（RL）は大規模言語モデル（LLM）の推論において顕著な成功を収めてきたが、パラメトリック知識の直接的な想起を改善できるかどうかは未解決の課題である。本研究では、この問いを、思考連鎖を用いず二値の正解報酬のみで学習し、事実レベルの学習-テスト重複除去を適用した、統制されたゼロショット・1ホップ・クローズドブックQA設定で検証する。これにより、改善が推論や記憶化ではなく想起の向上に起因することを保証する。3つのモデルファミリーと複数の事実型QAベンチマークにおいて、RLは平均約27％の相対的な改善を示し、学習時および推論時のベースラインをともに上回った。機構的には、RLは新たな事実を獲得するのではなく、既存の知識上の確率質量を再分配し、正解を低確率の裾野から信頼性の高い貪欲生成へと移動させる。データ帰属分析により、最も困難な事例が最も情報価値が高いことが明らかになった。すなわち、事前RLの128サンプル中に正解が一度も出現しない事例（学習データの約18％のみ）が、改善の約83％を牽引する。これは、訓練中にまれな正しいロールアウトが依然として出現し、それが強化されるためである。これらの知見は、RLの役割を推論を超えて拡張し、潜在的なパラメトリック知識を獲得するためではなく解き放つためのツールとして再位置づけるものである。

English

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

推論を超えて：強化学習が大規模言語モデルのパラメトリック知識を解放する

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

要旨

Support