インプレイステスト時学習

要旨

従来の静的「学習後配置」パラダイムは、大規模言語モデル（LLM）が実世界のタスクに内在する継続的な情報ストリームに応答して重みを動的に適応させる能力を根本的に制限している。Test-Time Training（TTT）は、推論時にモデルパラメータの一部（高速重み）を更新するという魅力的な代替手法を提供するが、その可能性は、アーキテクチャの非互換性、計算非効率性、言語モデリングに適さない高速重みの目的関数といった重大な障壁により、現在のLLMエコシステムでは阻害されている。本研究では、LLMにシームレスにTTT能力を付与するフレームワークであるIn-Place Test-Time Training（In-Place TTT）を提案する。In-Place TTTは、ユビキタスなMLPブロックの最終射影行列を適応可能な高速重みとして扱うことで、コストのかかるゼロからの再学習を必要とせず、LLMへの「ドロップイン」的な機能強化を実現する。さらに、TTTの汎用的な再構成目的関数を、自己回帰型言語モデリングを司るNext-Token-Predictionタスクに明示的に整合した、理論的根拠に基づく専用の目的関数に置き換える。この原理に基づいた目的関数と、効率的なチャンク単位の更新メカニズムを組み合わせることで、コンテキスト並列化と互換性の高い、拡張性に優れたアルゴリズムを実現する。大規模な実験により本フレームワークの有効性が検証されている：インプレース強化として、40億パラメータモデルが128kトークンまでのコンテキストを有するタスクで優れた性能を発揮し、スクラッチからの事前学習時には、競合するTTT関連アプローチを一貫して上回る。アブレーションスタディの結果は、設計選択に関するより深い知見を提供する。総合的に、我々の結果は、In-Place TTTがLLMにおける継続学習パラダイムへの有望な一歩であることを示唆している。

English

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.