原位测试时训练

摘要

静态的"先训练后部署"范式从根本上限制了大型语言模型（LLMs）根据现实任务中持续信息流动态调整权重的能力。测试时训练（TTT）通过在前馈阶段更新模型参数子集（快速权重）提供了引人注目的替代方案，但该方法在当前LLM生态中的应用潜力受制于架构不兼容、计算效率低下以及与语言建模目标不匹配的快速权重优化等关键障碍。本研究提出原位测试时训练框架（In-Place TTT），该框架可无缝赋予LLMs测试时训练能力。该方案将普遍存在的MLP模块中的最终投影矩阵作为可调节的快速权重，无需代价高昂的从头训练即可实现"即插即用"的模型增强。此外，我们采用与自回归语言建模核心任务——下一词预测目标显式对齐的理论化目标函数，替代了TTT通用的重构目标。这种原则性目标结合高效的分块更新机制，形成了兼容上下文并行的高度可扩展算法。大量实验验证了框架的有效性：作为原位增强方案，它使40亿参数模型在12.8万长度上下文任务中实现卓越性能；当用于从头预训练时，其表现持续优于同类TTT方法。消融实验结果进一步为我们的设计选择提供了深入见解。总体而言，本研究为LLMs实现持续学习范式迈出了重要一步。

English

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.