不变语言模型

摘要

Token embeddings，将离散词汇符号映射到连续向量，是任何语言模型（LM）的核心。然而，词汇符号的含义也可以通过它们在长上下文中的结构角色来确定甚至重新定义。本文提出一个问题：语言模型是否可以在没有固定的 token embeddings 的情况下表现良好？这样的语言模型必须完全依赖上下文中 token 的共现和重复，而不是任何 token 的先验身份。为了回答这个问题，我们研究了对词汇符号不变的 lexinvariant 语言模型，因此在实践中不需要固定的 token embeddings。首先，我们证明可以构建一个 lexinvariant LM，以多项式方式收敛到真实语言模型，其收敛速率与上下文长度成正比，常数因子与词汇量大小成亚线性关系。其次，为了构建 lexinvariant LM，我们简单地使用随机高斯向量对 token 进行编码，使得每个 token 在每个序列中映射到相同的表示，但在序列之间具有不同的表示。从经验上讲，我们证明它确实可以在给定足够长的上下文时达到与标准语言模型相媲美的困惑度。我们进一步探讨了 lexinvariant 语言模型的两个特性：首先，给定从英语替代密码生成的文本，它隐含地实现了基于贝叶斯的上下文解密，并以高准确度推断到底层真实 token 的映射。其次，在合成的上下文推理任务中，平均准确率提高了 4 倍。最后，我们讨论了将标准语言模型向 lexinvariance 规范化以及潜在的实际应用。

English

Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without any fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the a priori identity of any token. To answer this, we study lexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.

不变语言模型

Lexinvariant Language Models

摘要

Support