장문 컨텍스트를 위한 종단간 테스트 타임 학습

초록

우리는 장문맥 언어 모델링을 아키텍처 설계 문제가 아닌 연속 학습 문제로 재정의합니다. 이러한 재정의 하에서 우리는 슬라이딩 윈도우 어텐션을 적용한 표준 트랜스포머 아키텍처만을 사용합니다. 그러나 우리의 모델은 주어진 문맥에 대해 다음 토큰 예측을 통해 테스트 시간에 학습을 지속하며, 읽어 들이는 문맥을 자신의 가중치로 압축합니다. 또한 훈련 시간에 메타러닝을 통해 테스트 시간 학습을 위한 모델 초기화를 개선합니다. 전반적으로, 우리의 방법은 테스트 타임 트레이닝(TTT)의 한 형태로, 테스트 시간(다음 토큰 예측을 통해)과 훈련 시간(메타러닝을 통해) 모두에서 End-to-End(E2E)로 이루어지며, 이는 기존 형태와 대조적입니다. 우리는 확장 특성에 중점을 둔 폭넓은 실험을 수행합니다. 특히 164B 토큰으로 훈련된 3B 모델의 경우, 우리의 방법(TTT-E2E)은 전체 어텐션을 사용하는 트랜스포머와 동일한 방식으로 문맥 길이에 따라 성능이 확장되는 반면, Mamba 2나 Gated DeltaNet 등의 다른 방법들은 그렇지 못했습니다. 그러나 RNN과 유사하게 TTT-E2E는 문맥 길이와 관계없이 일정한 추론 지연 시간을 가지므로, 128K 문맥 길이에서 전체 어텐션 대비 2.7배 더 빠른 속도를 보입니다. 우리의 코드는 공개되어 있습니다.

English

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

장문 컨텍스트를 위한 종단간 테스트 타임 학습

End-to-End Test-Time Training for Long Context

초록

Support