L'Addestramento al Momento del Test con Legame KV è Segretamente un'Attenzione Lineare

Abstract

L'addestramento al momento del test (TTT) con legame KV come livello di modellazione sequenziale è comunemente interpretato come una forma di meta-apprendimento online che memorizza una mappatura chiave-valore durante il test. Tuttavia, la nostra analisi rivela molteplici fenomeni che contraddicono questa interpretazione basata sulla memorizzazione. Motivati da questi risultati, esaminiamo nuovamente la formulazione del TTT e dimostriamo che un'ampia classe di architetture TTT può essere espressa come una forma di operatore di attenzione lineare appresa. Oltre a spiegare comportamenti del modello precedentemente sconcertanti, questa prospettiva offre molteplici vantaggi pratici: consente semplificazioni architetturali basate su principi, ammette formulazioni completamente parallele che preservano le prestazioni migliorando l'efficienza e fornisce una riduzione sistematica di diverse varianti TTT a una forma standard di attenzione lineare. Nel complesso, i nostri risultati inquadrano il TTT non come memorizzazione al momento del test, ma come attenzione lineare appresa con capacità rappresentativa potenziata.

English

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

L'Addestramento al Momento del Test con Legame KV è Segretamente un'Attenzione Lineare

Test-Time Training with KV Binding Is Secretly Linear Attention

Abstract

Support