내재형 실시간 학습

초록

정적인 "학습 후 배포" 패러다임은 대규모 언어 모델(LLM)이 실제 작업에서 본질적으로 발생하는 지속적인 정보 흐름에 대해 가중치를 동적으로 적응시키는 것을 근본적으로 제한합니다. 테스트 타임 학습(TTT)은 추론 시점에 모델 매개변수의 일부(고속 가중치)를 업데이트하는 매력적인 대안을 제공하지만, 현재 LLM 생태계에서의 잠재력은 아키텍처 비호환성, 계산 비효율성, 언어 모델링에 부적합한 고속 가중치 목표 등 중요한 장벽에 의해 저해되고 있습니다. 본 연구에서는 LLM에 테스트 타임 학습 능력을 원활하게 부여하는 프레임워크인 인-플레이스 테스트 타임 학습(In-Place TTT)을 소개합니다. In-Place TTT는 널리 사용되는 MLP 블록의 최종 투영 행렬을 적응 가능한 고속 가중치로 간주하여, 처음부터 비용이 많이 드는 재학습 없이 LLM에 "드롭인" 방식의 성능 향상을 가능하게 합니다. 더 나아가, 우리는 TTT의 일반적인 재구성 목표를 자기회귀 언어 모델링을 지배하는 다음 토큰 예측 작업과 명시적으로 일치하도록 이론적 근거를 바탕으로 맞춤화된 목표로 대체합니다. 이 원칙에 기반한 목표와 효율적인 청크 단위 업데이트 메커니즘을 결합하여 컨텍스트 병렬화와 호환되는 높은 확장성의 알고리즘을 구현했습니다. 광범위한 실험을 통해 우리 프레임워크의 효율성을 검증했습니다: 인-플레이스 향상으로써 40억 개 매개변수 모델이 128k 토큰 길이의 컨텍스트를 가진 작업에서 우수한 성능을 달성할 수 있도록 하며, 처음부터 사전 학습될 경우 경쟁력 있는 TTT 관련 접근법들을 지속적으로 능가합니다. 추가로 수행한 애블레이션 연구 결과는 우리의 설계 선택에 대한 더 깊은 통찰을 제공합니다. 종합적으로, 우리의 결과는 In-Place TTT가 LLM의 지속적 학습 패러다임으로 나아가는 유망한 한 걸음임을 입증합니다.

English

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.