トークン埋め込みを超えた創発的意味論：凍結された視覚的ユニコード表現を備えたトランスフォーマー言語モデル

要旨

大規模言語モデル（LLMs）における意味表現の所在を理解することは、解釈可能性とアーキテクチャの革新にとって極めて重要である。従来の主流のパラダイムでは、学習可能な入力埋め込みが基礎的な「意味ベクトル」として機能するとされている。本論文はこの見解に異議を唱える。我々は、埋め込み層が完全に凍結され、データではなくUnicodeグリフの視覚的構造から導出されたベクトルを持つTransformerモデルを構築した。これらの非意味的で事前計算された視覚的埋め込みは、学習を通じて固定される。我々の手法は、あらゆるトークナイザと互換性があり、ユニバーサルなテキストカバレッジを保証するために導入した新しいUnicode中心のトークナイザも含まれる。学習可能で意味的に初期化された埋め込みが存在しないにもかかわらず、我々のモデルは収束し、一貫したテキストを生成し、特にMMLU推論ベンチマークにおいて、学習可能な埋め込みを持つアーキテクチャ的に同一のモデルを上回る性能を示した。我々はこれを、従来のモデルにおける「表現的干渉」に帰因する。従来のモデルでは、埋め込み層が構造的特徴と意味的特徴の両方を学習する負担を負っている。我々の結果は、高レベルの意味は入力埋め込みに内在するものではなく、Transformerの合成的アーキテクチャとデータ規模に伴って創発する特性であることを示唆している。これにより、埋め込みの役割は意味の容器から構造的プリミティブへと再定義される。我々は、さらなる研究を促進するために、すべてのコードとモデルを公開する。

English

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

トークン埋め込みを超えた創発的意味論：凍結された視覚的ユニコード表現を備えたトランスフォーマー言語モデル

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

要旨

Support