KV 바인딩을 활용한 테스트 타임 트레이닝은 은밀히 선형 어텐션과 동일합니다

초록

KV 바인딩을 시퀀스 모델링 계층으로 사용하는 테스트 타임 트레이닝(TTT)은 일반적으로 테스트 시간에 키-값 매핑을 암기하는 온라인 메타러닝의 한 형태로 해석됩니다. 그러나 우리의 분석은 이러한 암기 기반 해석과 모순되는 여러 현상을 보여줍니다. 이러한 발견에 동기를 부여받아 우리는 TTT의 공식을 재검토하고, 다양한 TTT 아키텍처 클래스가 학습된 선형 어텐션 연산자의 한 형태로 표현될 수 있음을 보입니다. 이전에 이해하기 어려웠던 모델 동작을 설명하는 것을 넘어, 이러한 관점은 여러 실용적인 이점을 제공합니다: 원칙에 따른 아키텍처 단순화를 가능하게 하고, 성능을 유지하면서 효율성을 향상시키는 완전 병렬 구성을 허용하며, 다양한 TTT 변형을 표준 선형 어텐션 형태로 체계적으로 축소합니다. 전반적으로, 우리의 결과는 TTT를 테스트 타임 암기가 아닌 향상된 표현 능력을 가진 학습된 선형 어텐션으로 재정의합니다.

English

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

KV 바인딩을 활용한 테스트 타임 트레이닝은 은밀히 선형 어텐션과 동일합니다

Test-Time Training with KV Binding Is Secretly Linear Attention

초록

Support