TIDE: 各層が文脈下のトークンを認識する

要旨

我々は、あらゆる現代的大規模言語モデル（LLM）において普遍的に受け入れられながらも十分に検証されてこなかった設計上の選択、すなわち「トークンインデックスは入力埋め込み層で一度だけ参照され、その後恒久的に破棄される」という点を再検討する。この単一注入仮定は二つの構造的欠陥を引き起こす：(i) 語彙のZipf型分布に起因する稀頻出トークン問題。稀なトークンの埋め込みは、一般的なトークンと比較して累積勾配信号のごく一部しか受け取らないため、慢性的に学習不足に陥る。(ii) 文脈崩壊問題。パラメータ数が限られたモデルでは、分布的に類似したトークンが区別不能な隠れ状態に写像されてしまう。これら双方への対策として、我々はTIDEを提案する。TIDEは標準的なトランスフォーマーをEmbeddingMemoryで拡張する。これは、K個の独立したMemoryBlockからなるアンサンブルであり、トークンインデックスを文脈非依存の意味ベクトルに写像する。このベクトルは一度計算され、学習可能なnull bankを備えた深度条件付きソフトマックスルーターを介して各層に注入される。我々は、単一トークン同一性注入に伴う問題の解決および、複数の言語モデリングタスクと下流タスクにわたる性能向上において、TIDEの利点を理論的・実証的に立証する。

English

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

TIDE: 各層が文脈下のトークンを認識する

TIDE: Every Layer Knows the Token Beneath the Context

要旨

Support