テスト時学習におけるKVバインディングは密かに線形注意機構である

要旨

テストタイムトレーニング（TTT）におけるキーバリュー結合をシーケンスモデリング層として用いる手法は、一般に、テスト時にキーバリューマッピングを記憶するオンラインメタ学習の一形態と解釈されてきました。しかし、我々の分析により、この記憶ベースの解釈と矛盾する複数の現象が明らかになりました。これらの知見を踏まえ、我々はTTTの定式化を再検討し、広範なTTTアーキテクチャのクラスが、学習された線形注意演算子の一形態として表現できることを示します。これまで不可解だったモデルの挙動を説明できるだけでなく、この視点は複数の実用的利点をもたらします。つまり、原理に基づいたアーキテクチャの簡素化を可能にし、性能を維持しながら効率を向上させる完全並列定式化を認め、多様なTTT変種を標準的な線形注意形式へ系統的に還元することを可能にします。全体として、我々の結果はTTTをテスト時の記憶としてではなく、表現能力が強化された学習済み線形注意として再定義するものです。

English

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

テスト時学習におけるKVバインディングは密かに線形注意機構である

Test-Time Training with KV Binding Is Secretly Linear Attention

要旨

Support