超越推理：强化学习解锁大型语言模型中的参数化知识

摘要

强化学习（RL）在大型语言模型的推理能力中已取得显著成功，但其能否提升对参数化知识的直接回忆仍是一个未解之谜。我们通过受控的零样本、单跳、闭卷问答设置研究该问题，该设置不涉及思维链，仅基于二元正确性奖励进行训练，并采用事实级训练-测试去重操作，以确保性能提升反映的是回忆能力的增强，而非推理或记忆能力的提升。在三个模型家族及多个事实性问答基准测试中，强化学习实现了约27%的平均相对增益，超越了训练时与推理时的基线方法。从机制上看，强化学习主要通过对既有知识的概率质量进行重新分配（而非获取新事实），将正确答案从低概率尾部转移至可靠的贪婪生成结果中。我们的数据归因研究表明，最具信息价值的是最困难的样本：那些在强化学习前128次采样中从未出现其答案的样本（约占训练数据的18%），却贡献了约83%的增益——这是因为训练过程中仍会偶然产生正确的生成结果，并得到强化。综合来看，这些发现将强化学习的作用从推理领域进一步扩展，将其重新定位为一种解锁而非获取潜在参数化知识的工具。

English

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

超越推理：强化学习解锁大型语言模型中的参数化知识

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

摘要

Support