EtCon:先编辑后巩固——实现可靠知识编辑的新方法
EtCon: Edit-then-Consolidate for Reliable Knowledge Editing
December 4, 2025
作者: Ruilin Li, Yibin Wang, Wenhong Zhu, Chenglin Li, Jinghao Zhang, Chenliang Li, Junchi Yan, Jiaqi Wang
cs.AI
摘要
知识编辑旨在无需完全重新训练的情况下更新大语言模型中的特定事实。先前的研究尝试调整大语言模型的知识层,证明了对特定内容进行选择性编辑的有效性。然而,这些方法在受控的教师强制评估中的表现与其在终身学习场景中的实际效果之间存在显著差距,这极大限制了其实际应用价值。本文的实证分析揭示了导致该差距的两个核心问题:(1)多数传统方法会使编辑后的模型对新事实过拟合,从而削弱预训练能力;(2)严重缺乏知识巩固阶段,导致新知识未能充分融入大语言模型在自回归生成中的推理行为,造成参数化知识与实际生成行为不匹配。为此,我们提出"编辑后巩固"这一新型知识编辑范式,旨在弥合理论方法与实际应用之间的鸿沟。具体而言:(1)通过基于信任域目标的定向近端监督微调定位编辑范围,限制策略漂移,从而抑制过拟合;(2)随后采用分组相对策略优化的巩固阶段,通过基于综合奖励信号的轨迹级行为优化,将编辑后的知识与基于思维链的推理策略对齐。大量实验表明,本框架在真实场景评估中持续提升编辑的可靠性与泛化能力,同时更好地保持了局部性与预训练能力。
English
Knowledge editing aims to update specific facts in large language models (LLMs) without full retraining. Prior efforts sought to tune the knowledge layers of LLMs, proving effective for making selective edits. However, a significant gap exists between their performance in controlled, teacher-forcing evaluations and their real-world effectiveness in lifelong learning scenarios, which greatly limits their practical applicability. This work's empirical analysis reveals two recurring issues associated with this gap: (1) Most traditional methods lead the edited model to overfit to the new fact, thereby degrading pre-trained capabilities; (2) There is a critical absence of a knowledge consolidation stage, leaving new facts insufficiently integrated into LLMs' inference-time behavior under autoregressive generation, thereby leading to a mismatch between parametric knowledge and actual generation behavior. To this end, we propose Edit-then-Consolidate, a novel knowledge editing paradigm that aims to bridge the gap between theoretical knowledge editing methods and their real-world applicability. Specifically, (1) our framework mitigates overfitting via Targeted Proximal Supervised Fine-Tuning (TPSFT) that localizes the edit via a trust-region objective to limit policy drift; (2) Then, a consolidation stage using Group Relative Policy Optimization (GRPO) aligns the edited knowledge with CoT-based inference policy by optimizing trajectory-level behavior under comprehensive reward signals. Extensive experiments demonstrate our framework consistently improves editing reliability and generalization under real-world evaluations, while better preserving locality and pre-trained capabilities.