語言模型需要睡眠：學習自我調整與鞏固記憶

摘要

過去幾十年間，機器學習演算法的設計取得了重大進展，從早期針對特定任務的淺層模型，到近期更通用的深度大型語言模型（LLMs）。儘管這些模型在需要即時預測或情境學習的任務中展現出潛力，但它們缺乏持續學習的能力，也無法有效地將其時間性情境知識轉移至長期參數中。受人類學習過程啟發，我們引入了一種「睡眠」範式，使模型能夠持續學習，透過重播將其短期脆弱的記憶蒸餾為穩定的長期知識，並藉由「作夢」過程遞迴地自我改進。具體而言，睡眠包含兩個階段：（1）記憶鞏固：一種向上蒸餾的過程，稱為知識播種，將較小自我的記憶蒸餾至較大網路中，以在保留知識的同時提供更大容量。作為概念驗證，我們提出了一種新的通用蒸餾過程來實現知識播種（即同策略蒸餾與基於強化學習的模仿學習之結合）；（2）作夢：一個自我改進階段，模型利用強化學習生成合成資料的課程，以演練新知識並完善現有能力，無需人類監督。我們在長程任務、持續學習、知識融入及少量樣本泛化任務上的實驗，支持了睡眠階段的重要性。

English

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.