學習在測試時學習：具有豐富隱藏狀態的循環神經網絡

摘要

自我關注在長文本中表現良好，但具有二次複雜度。現有的循環神經網絡層具有線性複雜度，但在長文本中的表現受到其隱藏狀態表達能力的限制。我們提出一種新的序列建模層類型，具有線性複雜度和豐富的隱藏狀態。關鍵思想是將隱藏狀態設計為一個機器學習模型本身，並將更新規則設計為一個自監督學習步驟。由於隱藏狀態是通過在測試序列上進行訓練而更新的，我們的層被稱為測試時間訓練（TTT）層。我們考慮兩種實例化：TTT-Linear 和 TTT-MLP，其隱藏狀態分別是一個線性模型和一個兩層 MLP。我們在 1.25 億至 13 億參數的規模上評估我們的實例化，與一個強大的Transformer和一個現代的RNN Mamba進行比較。TTT-Linear 和 TTT-MLP 都與基準相匹配或超越。與Transformer類似，它們可以通過對更多標記進行條件獲得不斷降低的困惑度，而Mamba 在 16k 上下文後無法實現。通過初步系統優化，TTT-Linear 在 8k 上下文時已經比Transformer更快，並與Mamba在實際時間上相匹配。TTT-MLP 在記憶體I/O方面仍然面臨挑戰，但在長文本中展現出更大的潛力，為未來研究指明了一個有前途的方向。

English

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

學習在測試時學習：具有豐富隱藏狀態的循環神經網絡

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

摘要

Support