语言模型需要睡眠：学习自我修改与记忆巩固

摘要

过去几十年，机器学习算法的设计取得了显著进展——从早期针对特定任务的浅层模型研究，发展到更通用的深度大语言模型（LLMs）。尽管现有模型在需要即时预测或上下文学习的任务中展现出可喜成果，但它们仍缺乏持续学习的能力，且无法将时间维度上的上下文知识有效迁移至长期参数中。受人类学习过程的启发，我们引入了一种"睡眠"范式，使模型能够持续学习，通过回放将其短期脆弱记忆蒸馏为稳定的长期知识，并通过"做梦"过程实现递归式自我提升。具体而言，睡眠包含两个阶段：（1）记忆巩固：一个名为"知识播种"的向上蒸馏过程——将较小规模自我的记忆蒸馏至更大网络中，在保留知识的同时提供更大容量。作为概念验证，我们提出了一种新的广义蒸馏过程实现"知识播种"（即基于策略的蒸馏与强化学习模仿学习的结合）；（2）做梦：自我改进阶段，模型利用强化学习生成合成数据课程，无需人工监督即可演练新知识并完善现有能力。我们在长时域任务、持续学习、知识融合及少样本泛化任务上的实验，验证了睡眠阶段的重要性。

English

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.