学习,快与慢:走向持续适应的大语言模型
Learning, Fast and Slow: Towards LLMs That Adapt Continually
May 12, 2026
作者: Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri
cs.AI
摘要
大型语言模型(LLMs)通常通过更新参数(例如,基于强化学习)来训练以完成下游任务。然而,参数更新迫使模型吸收任务特定信息,可能导致灾难性遗忘与可塑性丧失。相比之下,基于固定LLM参数的上下文学习能以低成本快速适应任务特定需求(如提示优化),但通常无法达到参数更新方式所能实现的性能提升。将学习局限于上下文或权重层面并非合理选择。此外,人类也可能在不同时间尺度上进行学习(例如系统1与系统2)。为此,我们提出一种面向LLM的快慢学习框架:将模型参数视为"慢权重",优化后的上下文视为"快权重"。这些快"权重"可通过文本反馈吸收任务特定信息,同时使慢权重更接近基础模型并保持通用推理行为。快慢训练(FST)在推理任务中比纯慢学习(强化学习)样本效率提升高达3倍,且始终能达到更高性能上限。此外,经FST训练的模型更接近基础LLM(KL散度降低高达70%),因此比强化学习训练产生的灾难性遗忘更少。这种较低的漂移还能保持可塑性:在完成一个任务训练后,经FST训练的模型比纯参数更新模型能更高效地适应后续任务。在任务领域实时变化的持续学习场景中,FST能持续获取每个新任务,而纯参数强化学习则陷入停滞。
English
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.