端到端長上下文測試時訓練

摘要

我們將長上下文語言建模視為持續學習問題而非架構設計問題。在此框架下，我們僅採用標準架構——帶滑動窗口注意力的Transformer模型。然而，我們的模型在測試時會通過對給定上下文進行下一詞預測來持續學習，將讀取的上下文壓縮至其權重中。此外，我們通過訓練時的元學習優化模型在測試時學習的初始化狀態。總體而言，我們的方法作為測試時訓練（TTT）的一種形式，在測試時（通過下一詞預測）和訓練時（通過元學習）均實現端到端（E2E）處理，與先前形式形成對比。我們開展了大量聚焦於擴展特性的實驗。值得注意的是，對於使用1640億詞元訓練的30億參數模型，我們的方法（TTT-E2E）隨上下文長度的擴展方式與全注意力Transformer保持一致，而其他方法（如Mamba 2和Gated DeltaNet）則未呈現此特性。但與RNN類似，TTT-E2E的推理延遲不受上下文長度影響，在處理12.8萬長度上下文時比全注意力機制快2.7倍。我們的代碼已公開釋出。

English

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

端到端長上下文測試時訓練

End-to-End Test-Time Training for Long Context

摘要

Support