超越推理：強化學習解鎖大型語言模型中的參數知識

摘要

強化學習（RL）在大型語言模型的推理任務中取得了顯著成功，但其是否能改善參數化知識的直接回憶仍是未解之題。我們在受控的零樣本、單跳、封閉式問答設定中研究此問題——無使用思維鏈，僅基於二元正確性獎勵進行訓練，並採用事實層級的訓練-測試去重機制，以確保性能提升反映的是回憶能力的增強，而非推理或記憶效應。跨三個模型系列與多個事實性問答基準測試，RL平均帶來約27%的相對提升，表現優於訓練階段與推理階段的基線方法。機制上，RL主要將機率質量重新分配於既有知識，而非獲取新事實——將正確答案從低機率尾部移至可靠的貪婪生成中。我們的數據歸因研究揭示，最困難的樣本最具信息價值：那些在RL訓練前128次採樣中從未出現答案的樣本（約佔訓練數據的18%）驅動了約83%的增益，這是因為訓練過程中仍會湧現罕見的正確生成序列，並因此被強化。綜合這些發現，RL的角色得以從推理拓展至知識喚醒——它並非獲取潛在參數化知識，而是解鎖這些知識的工具。

English

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

超越推理：強化學習解鎖大型語言模型中的參數知識

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

摘要

Support