ChatPaper.aiChatPaper

强化学习优化大型语言模型在层级知识结构中的遍历能力

Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs

November 8, 2025
作者: Renfei Zhang, Manasa Kaniselvan, Niloofar Mireshghallah
cs.AI

摘要

强化学习(RL)常被认为能以牺牲记忆知识为代价来提升语言模型的推理与泛化能力。我们通过观察发现,经过RL增强的模型在纯粹的知识召回任务(尤其是需要遍历层次化结构知识的任务,如医疗代码查询)中持续优于基础模型及监督微调(SFT)模型,这一现象对传统观点提出了挑战。我们推测这些提升并非源于新获取的数据,而是源于模型在参数内部导航和搜索既有知识层次结构的程序性技能得到改善。为验证该假设,我们证明通过结构化提示(即在SFT模型中显式引导其进行层次遍历)可弥补大部分性能差距(在MedConceptsQA任务上将DeepSeek-V3/R1的差距从24个百分点缩减至7个百分点)。进一步研究发现,虽然提示策略能提升最终答案的准确率,但RL增强模型在深度检索任务中仍保持更优的正确程序路径召回能力。最后,我们的层级内部激活分析表明:尽管事实性表征(如“代码57.95指代尿路感染”的语句激活)在SFT与RL模型间保持较高的余弦相似度,但查询表征(如“代码57.95是什么”)却出现显著分化,这表明RL主要改变的是模型遍历知识的方式,而非知识表征本身。
English
Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves final-answer accuracy, RL-enhanced models retain superior ability to recall correct procedural paths on deep-retrieval tasks. Finally our layer-wise internal activation analysis reveals that while factual representations (e.g., activations for the statement "code 57.95 refers to urinary infection") maintain high cosine similarity between SFT and RL models, query representations (e.g., "what is code 57.95") diverge noticeably, indicating that RL primarily transforms how models traverse knowledge rather than the knowledge representation itself.
PDF72December 2, 2025