学習、ファスト・アンド・スロー：継続的に適応するLLMを目指して

要旨

大規模言語モデル（LLM）は、パラメータの更新（例：強化学習）により下流タスク向けに訓練される。しかし、パラメータ更新はタスク固有の情報を吸収させるため、破滅的忘却や可塑性の喪失を引き起こす可能性がある。対照的に、固定されたLLMパラメータを用いたインコンテキスト学習は、タスク固有の要件（例：プロンプト最適化）に低コストかつ迅速に適応できるが、通常はパラメータ更新による性能向上に匹敵することはできない。学習をインコンテキストかインウェイトかに制限する正当な理由はない。さらに、人間もおそらく異なるタイムスケールで学習する（例：システム1対2）。そこで本稿では、LLMのための高速-低速学習フレームワークを導入し、モデルパラメータを「低速」重み、最適化されたコンテキストを「高速」重みとする。これらの高速「重み」はテキストフィードバックから学習してタスク固有の情報を吸収できる一方、低速重みはベースモデルに近い状態を保ち、一般的な推論行動を維持する。高速-低速訓練（FST）は、推論タスクにおいて低速学習（強化学習）のみと比較して最大3倍のサンプル効率を達成し、一貫してより高い性能漸近線に到達する。さらに、FSTで訓練されたモデルはベースLLMに近い状態を保ち（最大70%少ないKLダイバージェンス）、強化学習訓練よりも破滅的忘却が少ない。このドリフトの低減は可塑性も維持する：あるタスクで訓練後、FST訓練モデルはパラメータのみで訓練されたモデルよりも後続タスクに効果的に適応する。タスクドメインが随時変化する継続学習シナリオにおいて、FSTは各新タスクを獲得し続けるのに対し、パラメータのみの強化学習は停滞する。

English

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

学習、ファスト・アンド・スロー：継続的に適応するLLMを目指して

Learning, Fast and Slow: Towards LLMs That Adapt Continually

要旨

Support