Lex不変言語モデル

要旨

トークン埋め込みは、離散的な語彙記号から連続ベクトルへの写像であり、あらゆる言語モデル（LM）の中核をなす。しかし、語彙記号の意味は、長い文脈におけるその構造的役割によっても決定され、再定義される可能性がある。本論文では、固定されたトークン埋め込みなしに言語モデルが高性能を発揮することが可能かどうかを問う。そのような言語モデルは、トークンの事前の同一性ではなく、文脈内でのトークンの共起と反復に完全に依存しなければならない。この問いに答えるため、語彙記号に対して不変であり、したがって実際には固定されたトークン埋め込みを必要としないlexinvariant言語モデルを研究する。まず、lexinvariant LMを構築することで、真の言語モデルに文脈長の多項式で表される一様な速度で収束し、語彙サイズに対して準線形の定数因子を持つことを証明する。次に、lexinvariant LMを構築するために、各トークンをランダムなガウスベクトルを使用してエンコードし、各シーケンス内では同じ表現にマッピングされるが、シーケンス間では異なる表現にマッピングされるようにする。実験的に、十分に長い文脈が与えられれば、標準的な言語モデルと同等のパープレキシティを達成できることを実証する。さらに、lexinvariant言語モデルの2つの特性を探る。第一に、英語の置換暗号から生成されたテキストが与えられた場合、暗黙的にベイジアンな文脈内解読を実装し、基礎となる実際のトークンへのマッピングを高精度で推論する。第二に、合成された文脈内推論タスクにおいて、平均して4倍の精度を有する。最後に、標準的な言語モデルをlexinvarianceに向けて正則化することと、潜在的な実用的応用について議論する。

English

Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without any fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the a priori identity of any token. To answer this, we study lexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.

Lex不変言語モデル

Lexinvariant Language Models

要旨

Support