Addestramento End-to-End al Momento del Test per Contesti Lunghi

Abstract

Formuliamo la modellazione linguistica a contesto lungo come un problema di apprendimento continuo piuttosto che di progettazione architetturale. In base a questa formulazione, utilizziamo esclusivamente un'architettura standard: un Transformer con attenzione a finestra scorrevole. Tuttavia, il nostro modello continua ad apprendere durante il test tramite la previsione del token successivo sul contesto fornito, comprimendo il contesto che legge nei propri pesi. Inoltre, miglioriamo l'inizializzazione del modello per l'apprendimento durante il test attraverso il meta-apprendimento in fase di addestramento. Nel complesso, il nostro metodo, una forma di Addestramento durante il Test (Test-Time Training, TTT), è End-to-End (E2E) sia durante il test (tramite la previsione del token successivo) che durante l'addestramento (tramite meta-apprendimento), a differenza delle forme precedenti. Condurremo esperimenti approfonditi concentrandoci sulle proprietà di scalabilità. In particolare, per modelli da 3B addestrati con 164B token, il nostro metodo (TTT-E2E) scala con la lunghezza del contesto allo stesso modo di un Transformer con attenzione completa, mentre altri, come Mamba 2 e Gated DeltaNet, non lo fanno. Tuttavia, similmente alle RNN, TTT-E2E ha una latenza di inferenza costante indipendentemente dalla lunghezza del contesto, rendendolo 2,7 volte più veloce dell'attenzione completa per un contesto di 128K. Il nostro codice è pubblicamente disponibile.

English

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

Addestramento End-to-End al Momento del Test per Contesti Lunghi

End-to-End Test-Time Training for Long Context

Abstract

Support