학습, 빠르고 느리게: 지속적으로 적응하는 LLM을 향하여

초록

대규모 언어 모델(LLM)은 파라미터를 업데이트하여(예: RL을 통해) 하위 과제를 수행하도록 학습된다. 그러나 파라미터를 업데이트하면 과제 고유 정보를 흡수해야 하므로 치명적 망각과 가소성 손실이 발생할 수 있다. 반면, 고정된 LLM 파라미터를 사용한 맥락 내 학습은 과제별 요구사항(예: 프롬프트 최적화)에 저렴하고 빠르게 적응할 수 있지만, 일반적으로 LLM 파라미터 업데이트를 통해 얻을 수 있는 성능 향상에는 미치지 못한다. 학습을 맥락 내 또는 가중치 내로 제한할 합당한 이유는 없다. 또한 인간 역시 다양한 시간 척도에서 학습할 가능성이 높다(예: 시스템 1 대 2). 이에 본 연구는 LLM을 위한 빠른-느린 학습 프레임워크를 도입하며, 모델 파라미터를 '느린' 가중치로, 최적화된 맥락을 '빠른' 가중치로 설정한다. 이러한 빠른 '가중치'는 텍스트 피드백으로부터 학습하여 과제 고유 정보를 흡수하는 동시에, 느린 가중치는 기본 모델에 가깝게 유지되어 일반적인 추론 행동을 지속할 수 있게 한다. 빠른-느린 학습(FST)은 추론 과제 전반에서 느린 학습(RL)에 비해 최대 3배 더 샘플 효율적이며, 일관되게 더 높은 성능 점근선에 도달한다. 또한 FST로 학습된 모델은 기본 LLM에 더 가깝게 유지되어(최대 70% 더 낮은 KL 발산), RL 학습보다 치명적 망각이 적다. 이러한 표류 감소는 가소성도 보존한다. 즉, 한 과제를 학습한 후 FST로 학습된 모델은 파라미터만 학습된 모델보다 후속 과제에 더 효과적으로 적응한다. 과제 영역이 실시간으로 변화하는 지속 학습 시나리오에서 FST는 각각의 새로운 과제를 계속 습득하는 반면, 파라미터 전용 RL은 정체된다.

English

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

학습, 빠르고 느리게: 지속적으로 적응하는 LLM을 향하여

Learning, Fast and Slow: Towards LLMs That Adapt Continually

초록

Support