렉스인베리언트 언어 모델

초록

토큰 임베딩은 이산적인 어휘 기호를 연속적인 벡터로 매핑하는 것으로, 모든 언어 모델(LM)의 핵심을 이루고 있다. 그러나 어휘 기호의 의미는 긴 문맥에서의 구조적 역할에 의해 결정되거나 심지어 재정의될 수도 있다. 본 논문에서는 고정된 토큰 임베딩 없이도 언어 모델이 성능을 발휘할 수 있는지에 대해 질문한다. 이러한 언어 모델은 토큰의 사전적 정체성보다는 문맥 내 토큰의 동시 발생과 반복에 전적으로 의존해야 한다. 이를 답하기 위해, 어휘 기호에 불변하며 따라서 실제로 고정된 토큰 임베딩이 필요 없는 어휘 불변 언어 모델(lexinvariant language model)을 연구한다. 첫째, 문맥 길이에 대해 다항식적이고 어휘 크기에 대해 준선형인 상수 계수를 가지며, 균일한 속도로 진정한 언어 모델에 수렴하는 어휘 불변 언어 모델을 구성할 수 있음을 증명한다. 둘째, 어휘 불변 언어 모델을 구축하기 위해, 각 토큰을 무작위 가우시안 벡터로 인코딩하여 각 시퀀스 내에서는 동일한 표현을 가지지만 시퀀스 간에는 다른 표현을 가지도록 한다. 실험적으로, 충분히 긴 문맥이 주어졌을 때 표준 언어 모델과 비슷한 복잡도를 달성할 수 있음을 보인다. 또한, 어휘 불변 언어 모델의 두 가지 특성을 추가로 탐구한다: 첫째, 영어의 치환 암호로 생성된 텍스트가 주어졌을 때, 베이지안 문맥 내 해독을 암묵적으로 구현하고 기본 실제 토큰으로의 매핑을 높은 정확도로 추론한다. 둘째, 합성 문맥 추론 작업에서 평균 4배 더 나은 정확도를 보인다. 마지막으로, 표준 언어 모델을 어휘 불변성으로 정규화하는 것과 잠재적인 실제 응용에 대해 논의한다.

English

Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without any fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the a priori identity of any token. To answer this, we study lexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.

렉스인베리언트 언어 모델

Lexinvariant Language Models

초록

Support