TIDE: 모든 계층이 문맥 아래의 토큰을 인식한다

초록

현대 모든 LLM에서 보편적으로 채택되었으나 충분히 검토되지 않은 설계 선택을 재고한다: 토큰 인덱스는 입력 임베딩 층에서 한 번 조회된 후 영구적으로 폐기된다. 이러한 단일 주입 가정은 두 가지 구조적 결함을 야기한다: (i) 희소 토큰 문제 - 어휘의 Zipf형 분포로 인해 희소 토큰 임베딩이 빈도 높은 토큰에 비해 누적 그래디언트 신호의 일부만 수신하여 만성적으로 학습 부진 상태에 처하며, (ii) 문맥 붕괴 문제 - 제한된 매개변수 모델이 분포적으로 유사한 토큰을 구분 불가능한 은닉 상태로 매핑한다. 양 문제를 동시에 해결하기 위한 시도로, 표준 트랜스포머를 TIDE로 확장한다. TIDE는 EmbeddingMemory를 도입하는데, 이는 K개의 독립적 MemoryBlock 앙상블로 구성되어 토큰 인덱스를 문맥 독립적 의미 벡터로 매핑하며, 학습 가능한 null bank를 갖춘 깊이 조건부 소프트맥스 라우터를 통해 매 층에 주입된다. 우리는 이론적 및 실증적으로 TIDE가 단일 토큰 정체성 주입 관련 문제를 해결하는 이점과 더불어 다중 언어 모델링 및 하류 작업 전반에 걸친 성능 향상을 입증한다.

English

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

TIDE: 모든 계층이 문맥 아래의 토큰을 인식한다

TIDE: Every Layer Knows the Token Beneath the Context

초록

Support