語言不變模型
Lexinvariant Language Models
May 24, 2023
作者: Qian Huang, Eric Zelikman, Sarah Li Chen, Yuhuai Wu, Gregory Valiant, Percy Liang
cs.AI
摘要
詞元嵌入是從離散詞彙符號到連續向量的映射,是任何語言模型(LM)的核心。然而,詞彙符號的意義也可以通過它們在長篇文本中的結構角色來確定甚至重新定義。本文探討了一個問題:語言模型是否可以在沒有固定詞元嵌入的情況下表現出色?這樣的語言模型必須完全依賴上下文中詞元的共現和重複,而不是任何詞元的先驗身份。為了回答這個問題,我們研究了對詞彙符號不變的語言模型,因此在實踐中不需要固定的詞元嵌入。首先,我們證明我們可以構建一個對詞彙符號不變的LM,以多項式速率收斂到真實語言模型,其中速率與上下文長度成正比,並且常數因子與詞彙量的次線性成反比。其次,為了構建一個對詞彙符號不變的LM,我們簡單地使用隨機高斯向量對詞元進行編碼,使得每個詞元在每個序列中映射到相同的表示,但在序列之間映射到不同的表示。從實證角度來看,我們證明在給定足夠長的上下文的情況下,它確實可以達到與標準語言模型相當的困惑度。我們進一步探討了對詞彙符號不變的語言模型的兩個特性:首先,對於從英文替換密碼生成的文本,它隱式實現了基於貝葉斯的上下文解碼,並以高準確度推斷底層真實詞元的映射。其次,在合成上下文推理任務中,它的平均準確度比標準語言模型高出4倍。最後,我們討論了將標準語言模型規範化為對詞彙符號不變的方法以及潛在的實際應用。
English
Token embeddings, a mapping from discrete lexical symbols to continuous
vectors, are at the heart of any language model (LM). However, lexical symbol
meanings can also be determined and even redefined by their structural role in
a long context. In this paper, we ask: is it possible for a language model to
be performant without any fixed token embeddings? Such a language model
would have to rely entirely on the co-occurence and repetition of tokens in the
context rather than the a priori identity of any token. To answer
this, we study lexinvariantlanguage models that are invariant to
lexical symbols and therefore do not need fixed token embeddings in practice.
First, we prove that we can construct a lexinvariant LM to converge to the true
language model at a uniform rate that is polynomial in terms of the context
length, with a constant factor that is sublinear in the vocabulary size.
Second, to build a lexinvariant LM, we simply encode tokens using random
Gaussian vectors, such that each token maps to the same representation within
each sequence but different representations across sequences. Empirically, we
demonstrate that it can indeed attain perplexity comparable to that of a
standard language model, given a sufficiently long context. We further explore
two properties of the lexinvariant language models: First, given text generated
from a substitution cipher of English, it implicitly implements Bayesian
in-context deciphering and infers the mapping to the underlying real tokens
with high accuracy. Second, it has on average 4X better accuracy over synthetic
in-context reasoning tasks. Finally, we discuss regularizing standard language
models towards lexinvariance and potential practical applications.