ChatPaper.aiChatPaper

端到端的长上下文测试时训练

End-to-End Test-Time Training for Long Context

December 29, 2025
作者: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun
cs.AI

摘要

我们将长上下文语言建模构建为持续学习问题而非架构设计问题。在此框架下,我们仅采用标准架构——具有滑动窗口注意力的Transformer模型。但我们的模型在测试时通过给定上下文的下一个词预测持续学习,将其读取的上下文信息压缩至权重中。此外,我们通过训练时的元学习优化模型在测试时学习的初始化状态。总体而言,我们的方法作为测试时训练(TTT)的一种形式,在测试时(通过下一个词预测)和训练时(通过元学习)均实现端到端(E2E)处理,与此前方法形成鲜明对比。我们开展了大量实验,重点关注缩放特性。具体而言,对于使用1640亿词元训练的30亿参数模型,我们的方法(TTT-E2E)随上下文长度扩展的方式与全注意力Transformer保持一致,而其他方法(如Mamba 2和Gated DeltaNet)则无法实现。但类似于RNN,TTT-E2E无论上下文长度如何均保持恒定推理延迟,在128K上下文场景下比全注意力机制快2.7倍。我们的代码已公开。
English
We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
PDF60January 1, 2026