ChatPaper.aiChatPaper

GeRe:通过通用样本回放实现大语言模型持续学习中的高效抗遗忘

GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

August 6, 2025
作者: Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen
cs.AI

摘要

大型语言模型(LLMs)的持续学习能力对于推进通用人工智能至关重要。然而,跨多个领域对LLMs进行持续微调常常遭遇灾难性遗忘问题,表现为:1)其通用能力显著遗忘;2)在先前学习任务上的性能急剧下降。为了以简单而稳定的方式同时解决这两个问题,我们提出了通用样本回放(GeRe)框架,该框架利用常规预训练文本实现高效抗遗忘。在GeRe框架下,我们不仅重新审视了最主流的基于回放的实践,还进一步利用神经状态引入了一种基于阈值边际(TM)损失的增强激活状态约束优化方法,该方法在回放学习过程中保持激活状态的一致性。我们首次验证,一小部分预先收集的通用回放样本足以解决上述两个问题——既保留通用能力,又提升在序列任务中的整体性能。事实上,前者本质上能促进后者。通过控制实验,我们系统性地比较了GeRe框架下TM与不同回放策略的效果,包括基础的标签拟合、通过KL散度实现的logit模仿以及通过L1/L2损失实现的特征模仿。结果表明,TM持续提升性能并展现出更好的鲁棒性。我们的工作为未来LLMs的高效回放铺平了道路。代码与数据可在https://github.com/Qznan/GeRe获取。
English
The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.
PDF22August 13, 2025