長文脈のためのエンドツーエンドテスト時学習

要旨

我々は、長文脈言語モデリングをアーキテクチャ設計ではなく継続学習の問題として定式化する。この定式化の下では、スライディングウィンドウ注意機構を備えた標準的なTransformerアーキテクチャのみを使用する。しかし、本モデルはテスト時に与えられた文脈に対する次トークン予測を通じて学習を継続し、読み取った文脈を重みに圧縮する。さらに、訓練時のメタ学習を通じて、テスト時学習のためのモデル初期化を改善する。全体として、我々の手法（Test-Time Training: TTTの一形態）は、テスト時（次トークン予測による）と訓練時（メタ学習による）の両方でEnd-to-End（E2E）で動作する点が従来の形態と異なる。我々はスケーリング特性に焦点を当てた広範な実験を実施した。特に、164Bトークンで訓練した3Bモデルでは、我々の手法（TTT-E2E）は完全注意機構を持つTransformerと同様に文脈長に応じてスケールする一方、Mamba 2やGated DeltaNet等其他手法ではそれが見られない。しかしRNNと同様、TTT-E2Eは文脈長に関わらず推論レイテンシが一定であり、128K文長において完全注意機構よりも2.7倍高速である。実装コードは公開されている。

English

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

長文脈のためのエンドツーエンドテスト時学習

End-to-End Test-Time Training for Long Context

要旨

Support