TIDE：逐层洞悉上下文下的底层标记

摘要

我们重新审视了现代大语言模型中一个被普遍接受但未经充分检验的设计选择：词符索引仅在输入嵌入层被查找一次，随后便被永久丢弃。这种单次注入假设引发了两种结构性问题：（i）稀有词符问题——由于词汇表的齐普夫分布特性，稀有词符嵌入因获得的累计梯度信号远少于常见词符而长期训练不足；（ii）上下文坍缩问题——参数有限的模型会将分布相似的词符映射至难以区分的隐藏状态。为解决这两个问题，我们提出TIDE方法，通过嵌入记忆模块增强标准Transformer：该模块由K个独立记忆块组成，可将词符索引映射为上下文无关的语义向量，这些向量经一次性计算后，通过带有可学习空值库的深度条件化softmax路由器注入每一网络层。我们从理论和实验两方面验证了TIDE在解决单次词符身份注入相关问题上的优势，并在多类语言建模与下游任务中实现了性能提升。

English

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

TIDE：逐层洞悉上下文下的底层标记

TIDE: Every Layer Knows the Token Beneath the Context

摘要

Support