토큰 임베딩을 넘어선 새로운 의미론: 고정된 시각적 유니코드 표현을 활용한 트랜스포머 언어 모델

초록

대규모 언어 모델(LLM)에서 의미 표현의 위치를 이해하는 것은 해석 가능성과 아키텍처 혁신에 있어 매우 중요합니다. 기존의 지배적인 패러다임은 학습 가능한 입력 임베딩이 기본적인 "의미 벡터" 역할을 한다고 주장합니다. 본 논문은 이러한 관점에 도전합니다. 우리는 임베딩 층이 완전히 고정되고, 데이터가 아닌 유니코드 글리프의 시각적 구조에서 파생된 벡터를 사용하는 트랜스포머 모델을 구축했습니다. 이러한 비의미적이고 사전 계산된 시각적 임베딩은 학습 과정 내내 고정됩니다. 우리의 방법은 어떤 토크나이저와도 호환되며, 특히 모든 텍스트를 포괄할 수 있도록 설계된 새로운 유니코드 중심 토크나이저를 도입했습니다. 학습 가능하고 의미론적으로 초기화된 임베딩이 없음에도 불구하고, 우리의 모델은 수렴하고 일관된 텍스트를 생성하며, 특히 MMLU 추론 벤치마크에서 동일한 아키텍처를 가진 학습 가능한 임베딩 모델을 능가했습니다. 우리는 이를 기존 모델에서 임베딩 층이 구조적 특징과 의미적 특징을 동시에 학습해야 하는 "표현 간섭" 때문으로 해석합니다. 우리의 결과는 고수준의 의미가 입력 임베딩에 내재된 것이 아니라 트랜스포머의 구성적 아키텍처와 데이터 규모에서 나타나는 현상임을 시사합니다. 이는 임베딩의 역할을 의미의 담지자에서 구조적 기본 요소로 재정의합니다. 우리는 모든 코드와 모델을 공개하여 추가 연구를 촉진하고자 합니다.

English

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

토큰 임베딩을 넘어선 새로운 의미론: 고정된 시각적 유니코드 표현을 활용한 트랜스포머 언어 모델

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

초록

Support