언어 모델은 잠이 필요하다

초록

트랜스포머 기반 대규모 언어 모델이 장기 과제(long-horizon tasks)에 점차 널리 사용되고 있지만, 이들의 어텐션 메커니즘은 컨텍스트 길이에 따라 확장성이 낮다. 이를 해결하기 위해, 우리는 모델이 주기적으로 최근 컨텍스트를 지속적인 고속 가중치(persistent fast weights)로 변환한 후 키-값 캐시(key-value cache)를 초기화하는 수면과 같은 통합 메커니즘(sleep-like consolidation mechanism)을 연구한다. 수면 중에 모델은 축적된 컨텍스트에 대해 N번의 오프라인 순환 처리(offline recurrent passes)를 수행하고, 학습된 국소 규칙(learned local rule)을 통해 상태 공간 모델(SSM) 블록의 고속 가중치를 업데이트한다. 추론 중에는 추가 계산을 수면 단계로 이동시켜 각성 시간 예측(wake-time prediction)의 지연 시간(latency)을 유지한다. 우리는 세포 자동자(cellular automata) 및 다중 홉 그래프 검색(multi-hop graph retrieval)을 포함한 통제된 합성 과제와, 일반 트랜스포머 및 SSM-어텐션 하이브리드 모델이 실패하는 현실적인 수학 추론 과제에서 우리의 방법을 테스트한다. 그런 다음 우리 모델의 수면 기간 N을 늘리면 성능이 향상되며, 더 깊은 추론이 필요한 예제에서 가장 큰 이득이 있음을 보인다.

English

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.