学习（测试时学习）：具有表达力隐藏状态的循环神经网络

摘要

自注意力在处理长上下文时表现良好，但具有二次复杂度。现有的循环神经网络层具有线性复杂度，但它们在长上下文中的性能受到隐藏状态表达能力的限制。我们提出了一种新的序列建模层类别，具有线性复杂度和富有表现力的隐藏状态。关键思想是将隐藏状态本身作为一个机器学习模型，并将更新规则设定为自监督学习的一步。由于隐藏状态通过对测试序列进行训练而更新，我们的层被称为测试时训练（TTT）层。我们考虑了两种实例化：TTT-线性和TTT-MLP，其隐藏状态分别为线性模型和两层MLP。我们在125M到1.3B参数规模上评估了我们的实例化，与强大的Transformer和现代循环神经网络Mamba进行了比较。TTT-线性和TTT-MLP均与基线相匹配或超越。类似于Transformer，它们可以通过对更多标记进行调节而不断降低困惑度，而Mamba在16k上下文后无法做到。通过初步系统优化，TTT-线性在8k上下文时已经比Transformer更快，并与Mamba在挂钟时间上相匹配。TTT-MLP仍然面临内存I/O方面的挑战，但在长上下文中显示出更大的潜力，为未来研究指明了一个有前途的方向。

English

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

学习（测试时学习）：具有表达力隐藏状态的循环神经网络

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

摘要

Support